1 Case study 1: Self-esteem

Self-esteem generally describes a person’s overall sense of self-worthiness and personal value. It can play significant role in one’s motivation and success throughout the life. Factors that influence self-esteem can be inner thinking, health condition, age, life experiences etc. We will try to identify possible factors in our data that are related to the level of self-esteem.

In the well-cited National Longitudinal Study of Youth (NLSY79), it follows about 13,000 individuals and numerous individual-year information has been gathered through surveys. The survey data is open to public here. Among many variables we assembled a subset of variables including personal demographic variables in different years, household environment in 79, ASVAB test Scores in 81 and Self-Esteem scores in 81 and 87 respectively.

The data is store in NLSY79.csv.

Here are the description of variables:

Personal Demographic Variables

  • Gender: a factor with levels “female” and “male”
  • Education05: years of education completed by 2005
  • HeightFeet05, HeightInch05: height measurement. For example, a person of 5’10 will be recorded as HeightFeet05=5, HeightInch05=10.
  • Weight05: weight in lbs.
  • Income87, Income05: total annual income from wages and salary in 2005.
  • Job87 (missing), Job05: job type in 1987 and 2005, including Protective Service Occupations, Food Preparation and Serving Related Occupations, Cleaning and Building Service Occupations, Entertainment Attendants and Related Workers, Funeral Related Occupations, Personal Care and Service Workers, Sales and Related Workers, Office and Administrative Support Workers, Farming, Fishing and Forestry Occupations, Construction Trade and Extraction Workers, Installation, Maintenance and Repairs Workers, Production and Operating Workers, Food Preparation Occupations, Setters, Operators and Tenders, Transportation and Material Moving Workers

Household Environment

  • Imagazine: a variable taking on the value 1 if anyone in the respondent’s household regularly read magazines in 1979, otherwise 0
  • Inewspaper: a variable taking on the value 1 if anyone in the respondent’s household regularly read newspapers in 1979, otherwise 0
  • Ilibrary: a variable taking on the value 1 if anyone in the respondent’s household had a library card in 1979, otherwise 0
  • MotherEd: mother’s years of education
  • FatherEd: father’s years of education
  • FamilyIncome78

Variables Related to ASVAB test Scores in 1981

Test Description
AFQT percentile score on the AFQT intelligence test in 1981
Coding score on the Coding Speed test in 1981
Auto score on the Automotive and Shop test in 1981
Mechanic score on the Mechanic test in 1981
Elec score on the Electronics Information test in 1981
Science score on the General Science test in 1981
Math score on the Math test in 1981
Arith score on the Arithmetic Reasoning test in 1981
Word score on the Word Knowledge Test in 1981
Parag score on the Paragraph Comprehension test in 1981
Numer score on the Numerical Operations test in 1981

Self-Esteem test 81 and 87

We have two sets of self-esteem test, one in 1981 and the other in 1987. Each set has same 10 questions. They are labeled as Esteem81 and Esteem87 respectively followed by the question number. For example, Esteem81_1 is Esteem question 1 in 81.

The following 10 questions are answered as 1: strongly agree, 2: agree, 3: disagree, 4: strongly disagree

  • Esteem 1: “I am a person of worth”
  • Esteem 2: “I have a number of good qualities”
  • Esteem 3: “I am inclined to feel like a failure”
  • Esteem 4: “I do things as well as others”
  • Esteem 5: “I do not have much to be proud of”
  • Esteem 6: “I take a positive attitude towards myself and others”
  • Esteem 7: “I am satisfied with myself”
  • Esteem 8: “I wish I could have more respect for myself”
  • Esteem 9: “I feel useless at times”
  • Esteem 10: “I think I am no good at all”

1.1 Data preparation

Load the data. Do a quick EDA to get familiar with the data set. Pay attention to the unit of each variable. Are there any missing values?

## 'data.frame':    2431 obs. of  46 variables:
##  $ Subject       : int  2 6 7 8 9 13 16 17 18 20 ...
##  $ Gender        : chr  "female" "male" "male" "female" ...
##  $ Education05   : int  12 16 12 14 14 16 13 13 13 17 ...
##  $ Income87      : int  16000 18000 0 9000 15000 2200 27000 20000 28000 27000 ...
##  $ Job05         : chr  "4700 TO 4960: Sales and Related Workers" "10 TO 430: Executive, Administrative and Managerial Occupations" "7900 TO 8960: Setters, Operators and Tenders" "5000 TO 5930: Office and Administrative Support Workers" ...
##  $ Income05      : int  5500 65000 19000 36000 65000 8000 71000 43000 120000 64000 ...
##  $ Weight05      : int  160 187 175 246 180 235 160 188 173 130 ...
##  $ HeightFeet05  : int  5 5 5 5 5 6 5 5 5 5 ...
##  $ HeightInch05  : int  2 5 9 3 6 0 4 10 9 4 ...
##  $ Imagazine     : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ Inewspaper    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Ilibrary      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ MotherEd      : int  5 12 12 9 12 12 12 12 12 12 ...
##  $ FatherEd      : int  8 12 12 6 10 16 12 15 16 18 ...
##  $ FamilyIncome78: int  20000 35000 8502 7227 17000 20000 48000 15000 4510 50000 ...
##  $ Science       : int  6 23 14 18 17 16 13 19 22 21 ...
##  $ Arith         : int  8 30 14 13 21 30 17 29 30 17 ...
##  $ Word          : int  15 35 27 35 28 29 30 33 35 28 ...
##  $ Parag         : int  6 15 8 12 10 13 12 13 14 14 ...
##  $ Number        : int  29 45 32 24 40 36 49 35 48 39 ...
##  $ Coding        : int  52 68 35 48 46 30 58 58 61 54 ...
##  $ Auto          : int  9 21 13 11 13 21 11 18 21 18 ...
##  $ Math          : int  6 23 11 4 13 24 17 21 23 20 ...
##  $ Mechanic      : int  10 21 9 12 13 19 11 19 16 20 ...
##  $ Elec          : int  5 19 11 12 15 16 10 16 17 13 ...
##  $ AFQT          : num  6.84 99.39 47.41 44.02 59.68 ...
##  $ Esteem81_1    : int  1 2 2 1 1 1 2 2 2 1 ...
##  $ Esteem81_2    : int  1 1 1 1 1 1 2 2 2 1 ...
##  $ Esteem81_3    : int  4 4 3 3 4 4 3 3 3 3 ...
##  $ Esteem81_4    : int  1 2 2 2 1 1 2 2 2 1 ...
##  $ Esteem81_5    : int  3 4 3 3 1 4 3 3 3 3 ...
##  $ Esteem81_6    : int  3 2 2 2 1 1 2 2 2 2 ...
##  $ Esteem81_7    : int  1 2 2 3 1 1 3 2 2 1 ...
##  $ Esteem81_8    : int  3 4 2 3 4 4 3 3 3 3 ...
##  $ Esteem81_9    : int  3 3 3 3 4 4 3 3 3 3 ...
##  $ Esteem81_10   : int  3 4 3 3 4 4 3 3 3 3 ...
##  $ Esteem87_1    : int  2 1 2 1 1 1 1 2 1 1 ...
##  $ Esteem87_2    : int  1 1 2 1 1 1 1 2 1 1 ...
##  $ Esteem87_3    : int  4 4 4 3 4 4 4 3 4 4 ...
##  $ Esteem87_4    : int  1 1 2 1 1 1 2 2 1 4 ...
##  $ Esteem87_5    : int  2 4 4 4 4 4 4 3 4 4 ...
##  $ Esteem87_6    : int  2 1 2 2 1 1 2 2 1 1 ...
##  $ Esteem87_7    : int  2 2 2 1 1 2 2 2 2 1 ...
##  $ Esteem87_8    : int  3 3 4 2 4 4 4 3 4 3 ...
##  $ Esteem87_9    : int  3 2 3 2 4 4 3 3 3 4 ...
##  $ Esteem87_10   : int  4 4 4 2 4 4 4 3 4 4 ...
## [1] FALSE
##     Subject         Gender           Education05      Income87    
##  Min.   :    2   Length:2431        Min.   : 6.0   Min.   :   -2  
##  1st Qu.: 1592   Class :character   1st Qu.:12.0   1st Qu.: 4500  
##  Median : 3137   Mode  :character   Median :13.0   Median :12000  
##  Mean   : 3504                      Mean   :13.9   Mean   :13399  
##  3rd Qu.: 4668                      3rd Qu.:16.0   3rd Qu.:19000  
##  Max.   :12140                      Max.   :20.0   Max.   :59387  
##     Job05              Income05         Weight05    HeightFeet05  
##  Length:2431        Min.   :    63   Min.   : 81   Min.   :-4.00  
##  Class :character   1st Qu.: 22650   1st Qu.:150   1st Qu.: 5.00  
##  Mode  :character   Median : 38500   Median :180   Median : 5.00  
##                     Mean   : 49415   Mean   :183   Mean   : 5.18  
##                     3rd Qu.: 61350   3rd Qu.:209   3rd Qu.: 5.00  
##                     Max.   :703637   Max.   :380   Max.   : 8.00  
##   HeightInch05     Imagazine       Inewspaper       Ilibrary       MotherEd   
##  Min.   : 0.00   Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   : 0.0  
##  1st Qu.: 2.00   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:11.0  
##  Median : 5.00   Median :1.000   Median :1.000   Median :1.00   Median :12.0  
##  Mean   : 5.32   Mean   :0.718   Mean   :0.861   Mean   :0.77   Mean   :11.7  
##  3rd Qu.: 8.00   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.00   3rd Qu.:12.0  
##  Max.   :11.00   Max.   :1.000   Max.   :1.000   Max.   :1.00   Max.   :20.0  
##     FatherEd    FamilyIncome78     Science         Arith           Word     
##  Min.   : 0.0   Min.   :    0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0  
##  1st Qu.:10.0   1st Qu.:11167   1st Qu.:13.0   1st Qu.:13.0   1st Qu.:23.0  
##  Median :12.0   Median :20000   Median :17.0   Median :19.0   Median :28.0  
##  Mean   :11.9   Mean   :21252   Mean   :16.3   Mean   :18.6   Mean   :26.6  
##  3rd Qu.:14.0   3rd Qu.:27500   3rd Qu.:20.0   3rd Qu.:25.0   3rd Qu.:32.0  
##  Max.   :20.0   Max.   :75001   Max.   :25.0   Max.   :30.0   Max.   :35.0  
##      Parag          Number         Coding          Auto           Math     
##  Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0  
##  1st Qu.:10.0   1st Qu.:29.0   1st Qu.:38.0   1st Qu.:10.0   1st Qu.: 9.0  
##  Median :12.0   Median :36.0   Median :48.0   Median :14.0   Median :14.0  
##  Mean   :11.2   Mean   :35.5   Mean   :47.1   Mean   :14.3   Mean   :14.3  
##  3rd Qu.:14.0   3rd Qu.:44.0   3rd Qu.:57.0   3rd Qu.:18.0   3rd Qu.:20.0  
##  Max.   :15.0   Max.   :50.0   Max.   :84.0   Max.   :25.0   Max.   :25.0  
##     Mechanic         Elec           AFQT         Esteem81_1     Esteem81_2  
##  Min.   : 0.0   Min.   : 0.0   Min.   :  0.0   Min.   :1.00   Min.   :1.00  
##  1st Qu.:11.0   1st Qu.: 9.0   1st Qu.: 31.9   1st Qu.:1.00   1st Qu.:1.00  
##  Median :14.0   Median :12.0   Median : 57.0   Median :1.00   Median :1.00  
##  Mean   :14.4   Mean   :11.6   Mean   : 54.7   Mean   :1.42   Mean   :1.42  
##  3rd Qu.:18.0   3rd Qu.:15.0   3rd Qu.: 78.2   3rd Qu.:2.00   3rd Qu.:2.00  
##  Max.   :25.0   Max.   :20.0   Max.   :100.0   Max.   :4.00   Max.   :4.00  
##    Esteem81_3     Esteem81_4     Esteem81_5     Esteem81_6     Esteem81_7  
##  Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00  
##  1st Qu.:3.00   1st Qu.:1.00   1st Qu.:3.00   1st Qu.:1.00   1st Qu.:1.00  
##  Median :4.00   Median :2.00   Median :4.00   Median :2.00   Median :2.00  
##  Mean   :3.51   Mean   :1.57   Mean   :3.46   Mean   :1.62   Mean   :1.75  
##  3rd Qu.:4.00   3rd Qu.:2.00   3rd Qu.:4.00   3rd Qu.:2.00   3rd Qu.:2.00  
##  Max.   :4.00   Max.   :4.00   Max.   :4.00   Max.   :4.00   Max.   :4.00  
##    Esteem81_8     Esteem81_9    Esteem81_10    Esteem87_1     Esteem87_2 
##  Min.   :1.00   Min.   :1.00   Min.   :1.0   Min.   :1.00   Min.   :1.0  
##  1st Qu.:3.00   1st Qu.:3.00   1st Qu.:3.0   1st Qu.:1.00   1st Qu.:1.0  
##  Median :3.00   Median :3.00   Median :3.0   Median :1.00   Median :1.0  
##  Mean   :3.13   Mean   :3.16   Mean   :3.4   Mean   :1.38   Mean   :1.4  
##  3rd Qu.:4.00   3rd Qu.:4.00   3rd Qu.:4.0   3rd Qu.:2.00   3rd Qu.:2.0  
##  Max.   :4.00   Max.   :4.00   Max.   :4.0   Max.   :4.00   Max.   :4.0  
##    Esteem87_3     Esteem87_4    Esteem87_5     Esteem87_6     Esteem87_7  
##  Min.   :1.00   Min.   :1.0   Min.   :1.00   Min.   :1.00   Min.   :1.00  
##  1st Qu.:3.00   1st Qu.:1.0   1st Qu.:3.00   1st Qu.:1.00   1st Qu.:1.00  
##  Median :4.00   Median :1.0   Median :4.00   Median :2.00   Median :2.00  
##  Mean   :3.58   Mean   :1.5   Mean   :3.53   Mean   :1.59   Mean   :1.72  
##  3rd Qu.:4.00   3rd Qu.:2.0   3rd Qu.:4.00   3rd Qu.:2.00   3rd Qu.:2.00  
##  Max.   :4.00   Max.   :4.0   Max.   :4.00   Max.   :4.00   Max.   :4.00  
##    Esteem87_8    Esteem87_9    Esteem87_10  
##  Min.   :1.0   Min.   :1.00   Min.   :1.00  
##  1st Qu.:3.0   1st Qu.:3.00   1st Qu.:3.00  
##  Median :3.0   Median :3.00   Median :3.00  
##  Mean   :3.1   Mean   :3.06   Mean   :3.37  
##  3rd Qu.:4.0   3rd Qu.:4.00   3rd Qu.:4.00  
##  Max.   :4.0   Max.   :4.00   Max.   :4.00
##  [1] ""                                                                                   
##  [2] "10 TO 430: Executive, Administrative and Managerial Occupations"                    
##  [3] "1000 TO 1240: Mathematical and Computer Scientists"                                 
##  [4] "1300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians"
##  [5] "1600 TO 1760: Physical Scientists"                                                  
##  [6] "1800 TO 1860: Social Scientists and Related Workers"                                
##  [7] "1900 TO 1960: Life, Physical and Social Science Technicians"                        
##  [8] "2000 TO 2060: Counselors, Sociala and Religious Workers"                            
##  [9] "2100 TO 2150: Lawyers, Judges and Legal Support Workers"                            
## [10] "2200 TO 2340: Teachers"                                                             
## [11] "2400 TO 2550: Education, Training and Library Workers"                              
## [12] "2600 TO 2760: Entertainers and Performers, Sports and Related Workers"              
## [13] "2800 TO 2960: Media and Communications Workers"                                     
## [14] "3000 TO 3260: Health Diagnosing and Treating Practitioners"                         
## [15] "3300 TO 3650: Health Care Technical and Support Occupations"                        
## [16] "3700 TO 3950: Protective Service Occupations"                                       
## [17] "4000 TO 4160: Food Preparation and Serving Related Occupations"                     
## [18] "4200 TO 4250: Cleaning and Building Service Occupations"                            
## [19] "4300 TO 4430: Entertainment Attendants and Related Workers"                         
## [20] "4500 TO 4650: Personal Care and Service Workers"                                    
## [21] "4700 TO 4960: Sales and Related Workers"                                            
## [22] "500 TO 950: Management Related Occupations"                                         
## [23] "5000 TO 5930: Office and Administrative Support Workers"                            
## [24] "6000 TO 6130: Farming, Fishing and Forestry Occupations"                            
## [25] "6200 TO 6940: Construction Trade and Extraction Workers"                            
## [26] "7000 TO 7620: Installation, Maintenance and Repairs Workers"                        
## [27] "7700 TO 7750: Production and Operating Workers"                                     
## [28] "7800 TO 7850: Food Preparation Occupations"                                         
## [29] "7900 TO 8960: Setters, Operators and Tenders"                                       
## [30] "9000 TO 9750: Transportation and Material Moving Workers"                           
## [31] "9990: Uncodeable"
## 
##                                                                                     
##                                                                                  56 
##                     10 TO 430: Executive, Administrative and Managerial Occupations 
##                                                                                 377 
##                                  1000 TO 1240: Mathematical and Computer Scientists 
##                                                                                  64 
## 1300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians 
##                                                                                  53 
##                                                   1600 TO 1760: Physical Scientists 
##                                                                                   4 
##                                 1800 TO 1860: Social Scientists and Related Workers 
##                                                                                   6 
##                         1900 TO 1960: Life, Physical and Social Science Technicians 
##                                                                                   7 
##                             2000 TO 2060: Counselors, Sociala and Religious Workers 
##                                                                                  41 
##                             2100 TO 2150: Lawyers, Judges and Legal Support Workers 
##                                                                                  15 
##                                                              2200 TO 2340: Teachers 
##                                                                                 120 
##                               2400 TO 2550: Education, Training and Library Workers 
##                                                                                  29 
##               2600 TO 2760: Entertainers and Performers, Sports and Related Workers 
##                                                                                  24 
##                                      2800 TO 2960: Media and Communications Workers 
##                                                                                  13 
##                          3000 TO 3260: Health Diagnosing and Treating Practitioners 
##                                                                                  74 
##                         3300 TO 3650: Health Care Technical and Support Occupations 
##                                                                                  99 
##                                        3700 TO 3950: Protective Service Occupations 
##                                                                                  54 
##                      4000 TO 4160: Food Preparation and Serving Related Occupations 
##                                                                                  68 
##                             4200 TO 4250: Cleaning and Building Service Occupations 
##                                                                                  67 
##                          4300 TO 4430: Entertainment Attendants and Related Workers 
##                                                                                  10 
##                                     4500 TO 4650: Personal Care and Service Workers 
##                                                                                  42 
##                                             4700 TO 4960: Sales and Related Workers 
##                                                                                 205 
##                                          500 TO 950: Management Related Occupations 
##                                                                                 108 
##                             5000 TO 5930: Office and Administrative Support Workers 
##                                                                                 360 
##                             6000 TO 6130: Farming, Fishing and Forestry Occupations 
##                                                                                   9 
##                             6200 TO 6940: Construction Trade and Extraction Workers 
##                                                                                 135 
##                         7000 TO 7620: Installation, Maintenance and Repairs Workers 
##                                                                                 108 
##                                      7700 TO 7750: Production and Operating Workers 
##                                                                                  49 
##                                          7800 TO 7850: Food Preparation Occupations 
##                                                                                   4 
##                                        7900 TO 8960: Setters, Operators and Tenders 
##                                                                                 112 
##                            9000 TO 9750: Transportation and Material Moving Workers 
##                                                                                 117 
##                                                                    9990: Uncodeable 
##                                                                                   1

1.2 Self esteem evaluation

Let concentrate on Esteem scores evaluated in 87.

  1. First do a quick summary over all the Esteem variables. Pay attention to missing values, any peculiar numbers etc. How do you fix problems discovered if there is any? Briefly describe what you have done for the data preparation.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    1.00    1.38    2.00    4.00
## [1] FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.0     1.4     2.0     4.0
## [1] FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    3.58    4.00    4.00
## [1] FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.0     1.5     2.0     4.0
## [1] FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    3.53    4.00    4.00
## [1] FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    1.59    2.00    4.00
## [1] FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    1.72    2.00    4.00
## [1] FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     3.0     3.0     3.1     4.0     4.0
## [1] FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    3.00    3.06    4.00    4.00
## [1] FALSE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    3.00    3.37    4.00    4.00
## [1] FALSE

The first thing I did was create a summary of all the data to provide basic insights into the distribution of the Esteem_87 scores. After this I checked for missing values (of which there are none) and exmined the intepretation of the scores more carefully. From this, I understood that there were some questions which were framed in such a way that higher scores indicated higher levels of self-esteem, and other questions which were framed in such a way that lower score indicated higher self-esteem. This needs to be standardised across all the questions to ensure easy comparison across the different questions in Esteem_87.

  1. Please note that higher scores on Esteem questions 1, 2, 4, 6, and 7 indicate higher self-esteem, whereas higher scores on the remaining questions suggest lower self-esteem. To maintain consistency, consider reversing the scores of certain Esteem questions. For example, if the esteem data is stored in data.esteem, you can use the code data.esteem[, c(1, 2, 4, 6, 7)] <- 5 - data.esteem[, c(1, 2, 4, 6, 7)] to invert the scores.

To fix this, I identified questions which were framed in a positive way (Questions 1, 2, 4, 6 and 7). This meant that lower scores (“Strongly Agree”) indicated higher self-esteem. I inverted these scores, creating a standardised measure where higher scores across all questions indicated higher self-esteem.

  1. Write a brief summary with necessary plots about the 10 esteem measurements.

Esteem87_1 through to Esteem87_5 are highly left-skewed, meaning that the vast majority of tests scores are 3 and 4, and with means of 3.6, 3.58, 3.5, 3.53 and 3.41 respective Whilst the remaining Esteem87_6 through to Esteem87_10 are still left-skewed, they are to a lesser extent, with means of 3.28, 3.1, 3.06 and 3.37.

  1. Do esteem scores all positively correlated? Report the pairwise correlation table and write a brief summary.
##             Esteem87_1 Esteem87_2 Esteem87_3 Esteem87_4 Esteem87_5 Esteem87_6
## Esteem87_1       1.000      0.704      0.448      0.528      0.399      0.464
## Esteem87_2       0.704      1.000      0.443      0.551      0.402      0.481
## Esteem87_3       0.448      0.443      1.000      0.408      0.549      0.410
## Esteem87_4       0.528      0.551      0.408      1.000      0.381      0.509
## Esteem87_5       0.399      0.402      0.549      0.381      1.000      0.405
## Esteem87_6       0.464      0.481      0.410      0.509      0.405      1.000
## Esteem87_7       0.379      0.410      0.343      0.422      0.370      0.600
## Esteem87_8       0.273      0.283      0.351      0.295      0.381      0.409
## Esteem87_9       0.236      0.259      0.349      0.287      0.354      0.364
## Esteem87_10      0.312      0.330      0.460      0.366      0.436      0.442
##             Esteem87_7 Esteem87_8 Esteem87_9 Esteem87_10
## Esteem87_1       0.379      0.273      0.236       0.312
## Esteem87_2       0.410      0.283      0.259       0.330
## Esteem87_3       0.343      0.351      0.349       0.460
## Esteem87_4       0.422      0.295      0.287       0.366
## Esteem87_5       0.370      0.381      0.354       0.436
## Esteem87_6       0.600      0.409      0.364       0.442
## Esteem87_7       1.000      0.389      0.352       0.390
## Esteem87_8       0.389      1.000      0.430       0.438
## Esteem87_9       0.352      0.430      1.000       0.579
## Esteem87_10      0.390      0.438      0.579       1.000

All of the scores are positively correlated, with a minimum correlation between Esteem87_1 & Esteem87_8 (0.273) and Esteem87_1 & Esteem87_9 (0.236), and a maximum between Esteemed87_1 & Esteemed87_2 (0.704).

  1. PCA on 10 esteem measurements. (centered but no scaling)

    1. Report the PC1 and PC2 loadings. Are they unit vectors? Are they orthogonal?
##               PC1     PC2
## Esteem87_1  0.324 -0.4452
## Esteem87_2  0.333 -0.4283
## Esteem87_3  0.322  0.0115
## Esteem87_4  0.324 -0.2877
## Esteem87_5  0.315  0.0793
## Esteem87_6  0.347 -0.0492
## Esteem87_7  0.315  0.0196
## Esteem87_8  0.280  0.3619
## Esteem87_9  0.277  0.4917
## Esteem87_10 0.318  0.3918
Yes, both PC1 and PC2 loadings are orthogonal, unit vectors.

b) Are there good interpretations for PC1 and PC2? (If loadings are all negative, take the positive loadings for the ease of interpretation)

Loadings are direction vectors that define each PC. Large absolute loadings indicate a strong contribution, and the signs indicate in which direction they move relative to each other.     In this case, all the PC1 scores are positive, indicating that all the variables move in the same directon together. Furthermore, Esteem87_6 (0.347) and Esteem87_2 (0.333) are the        most significant loadings. Looking at PC2, we see that Esteem87_1, Esteem87_2, Esteem87_4 and Esteem87_6 are negative whilst the remaining are positive, indicating that these two sets     of variables move in contrasting directions. Furthermore, Esteem87_9 (0.4917), Esteem87_1 (-0.4452) and Esteem87_2 (-0.4283) are the most significant loadings.

c) How is the PC1 score obtained for each subject? Write down the formula.

The PC1 score for variable i is obtained as a linear combination of the standardised Esteem87 variables, each weighted by their corresponding loadings. In this case, the PC1 score is     obtained using the formula: PC1i = 0.324Zi1 + 0.333Zi2 + 0.322Zi3 + ... + 0.318Zi10.

d) Are PC1 scores and PC2 scores in the data uncorrelated? 

Yes, the PC1 and PC2 scores are uncorrelated because PC1 is orthogonal to PC2.

e) Plot PVE (Proportion of Variance Explained) and summarize the plot. 

<img src="hw2_sp2026_files/figure-html/unnamed-chunk-7-1.png" width="768" />

f) Also plot CPVE (Cumulative Proportion of Variance Explained). What proportion of the variance in the data is explained by the first two principal components?

<img src="hw2_sp2026_files/figure-html/unnamed-chunk-8-1.png" width="768" />

From this, we can see that 60% of the variance is explained by the first two variables.

g) PC’s provide us with a low dimensional view of the self-esteem scores. Use a biplot with the first two PC's to display the data. Give an interpretation of PC1 and PC2 from the plot. 

From this, we can see that all loadings for PC1 (x-axis) point in the positive direction, indicating a positive level of self-esteem across all variables. Since there are no extreme      loadings weightings, PC1 is essentially an average of all of the loadings, and as such, can be interpreted as a general level of self-esteem.

Conversely, looking at PC2 (y-axis) we see that the loadings go in both the positive and negative directions, breaking the variable loadings into two different groups. This is likely     visualise the different effects of positively-worded and negatively-worded questions. With the exception of Esteem87_7 "I am satisfied with myself", all of the negatively-worded          questions, reflect better (positive) self-esteem compared with positvely-worded questions.
  1. Apply k-means to cluster subjects on the original esteem scores

    1. Find a reasonable number of clusters using within sum of squared with elbow rules.

Looking at the Total Within-Cluster Sum of Squares, we can identify an elbow at 3 clusters

b) Can you summarize common features within each cluster?
## [1] 843 697 891
##       PC1    PC2
## 1 -2.3627  0.600
## 2 -0.0473 -1.130
## 3  2.2725  0.316

Cluster 1 contains 843 observations and is centred at (-2.3627, 0.6000); Cluster 2 contains 697 observations and is centred at (-0.0473, -1.1300); and Cluster 3 contains 891              observations and is centred at (2.2725, 0.3160).

Going of my interpretation of PC1 as overall self-esteem, we can then classify Cluster 1 as low self-esteem, Cluster 2 as average self-esteem and Cluster 3 as high self-esteem.           When examining PC2 above, we suggested that it may be the differing tone in which the questions were framed (positive and negative). Clusters 1 and 3 contain mostly positive values       with some negative values, whereas Cluster 2 contains mostly negative values. This interpretation does not apply to these clusters because there are three clusters, rather than 2, and     each cluster contains a range of positive and negative values. As such, this clusters around some factor impacting PC2, however, we were unable to find a clear interpretation.

c) Can you visualize the clusters with somewhat clear boundaries? You may try different pairs of variables and different PC pairs of the esteem scores.

Note, in this case, we have chosen to only cluster around PCs due to the potential for multicollinearity between variables in the data and the presence of unwanted noise.
## [1] 799 776 856
##        PC1    PC3
## 1 -0.00167 -0.704
## 2 -2.56519  0.339
## 3  2.32701  0.350

## [1]  605 1060  766
##      PC2      PC3
## 1  0.638 -0.98529
## 2  0.560  0.55887
## 3 -1.279  0.00482

  1. We now try to find out what factors are related to self-esteem? PC1 of all the Esteem scores is a good variable to summarize one’s esteem scores. We take PC1 as our response variable.

    1. Prepare possible factors/variables:

      Firstly, we have conducted PCA on the ASVAB dataset, extracting PC1 scores and adding them to the dataset as a general level of intelligence.

    Next, we will create a BMI variable to summarise an individual’s body height and weight.

    Finally, we are going to remove the unwanted variables from the dataset (specifically, Esteem81 scores, AFQT scores except for AFQT). The primary reason for this is that these variables are already described through other variables such as Intelligence or Esteem, or they are not needed (like Esteem81).

## 'data.frame':    2431 obs. of  15 variables:
##  $ Subject       : int  2 6 7 8 9 13 16 17 18 20 ...
##  $ Gender        : chr  "female" "male" "male" "female" ...
##  $ Education05   : int  12 16 12 14 14 16 13 13 13 17 ...
##  $ Income87      : int  16000 18000 0 9000 15000 2200 27000 20000 28000 27000 ...
##  $ Job05         : chr  "4700 TO 4960: Sales and Related Workers" "10 TO 430: Executive, Administrative and Managerial Occupations" "7900 TO 8960: Setters, Operators and Tenders" "5000 TO 5930: Office and Administrative Support Workers" ...
##  $ Income05      : int  5500 65000 19000 36000 65000 8000 71000 43000 120000 64000 ...
##  $ Imagazine     : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ Inewspaper    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Ilibrary      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ MotherEd      : int  5 12 12 9 12 12 12 12 12 12 ...
##  $ FatherEd      : int  8 12 12 6 10 16 12 15 16 18 ...
##  $ FamilyIncome78: int  20000 35000 8502 7227 17000 20000 48000 15000 4510 50000 ...
##  $ Esteem_PC1    : num  -0.54 1.4 -0.38 -0.62 3.07 ...
##  $ Intelligence  : num  -4.366 4.545 -1.603 -0.872 0.312 ...
##  $ BMI           : num  29.3 31.1 25.8 43.6 29.1 ...
Following the data preparation, we will conduct some EDA to gain a sense of the structure of the final dataset as well as the distribution of each variable.
##     Subject         Gender           Education05      Income87    
##  Min.   :    2   Length:2431        Min.   : 6.0   Min.   :   -2  
##  1st Qu.: 1592   Class :character   1st Qu.:12.0   1st Qu.: 4500  
##  Median : 3137   Mode  :character   Median :13.0   Median :12000  
##  Mean   : 3504                      Mean   :13.9   Mean   :13399  
##  3rd Qu.: 4668                      3rd Qu.:16.0   3rd Qu.:19000  
##  Max.   :12140                      Max.   :20.0   Max.   :59387  
##     Job05              Income05        Imagazine       Inewspaper   
##  Length:2431        Min.   :    63   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.: 22650   1st Qu.:0.000   1st Qu.:1.000  
##  Mode  :character   Median : 38500   Median :1.000   Median :1.000  
##                     Mean   : 49415   Mean   :0.718   Mean   :0.861  
##                     3rd Qu.: 61350   3rd Qu.:1.000   3rd Qu.:1.000  
##                     Max.   :703637   Max.   :1.000   Max.   :1.000  
##     Ilibrary       MotherEd       FatherEd    FamilyIncome78    Esteem_PC1     
##  Min.   :0.00   Min.   : 0.0   Min.   : 0.0   Min.   :    0   Min.   :-9.0499  
##  1st Qu.:1.00   1st Qu.:11.0   1st Qu.:10.0   1st Qu.:11167   1st Qu.:-1.8806  
##  Median :1.00   Median :12.0   Median :12.0   Median :20000   Median : 0.0669  
##  Mean   :0.77   Mean   :11.7   Mean   :11.9   Mean   :21252   Mean   : 0.0000  
##  3rd Qu.:1.00   3rd Qu.:12.0   3rd Qu.:14.0   3rd Qu.:27500   3rd Qu.: 1.9200  
##  Max.   :1.00   Max.   :20.0   Max.   :20.0   Max.   :75001   Max.   : 3.0734  
##   Intelligence         BMI       
##  Min.   :-9.667   Min.   : 11.9  
##  1st Qu.:-1.849   1st Qu.: 24.1  
##  Median : 0.342   Median : 27.3  
##  Mean   : 0.000   Mean   : 28.1  
##  3rd Qu.: 2.120   3rd Qu.: 30.9  
##  Max.   : 5.026   Max.   :169.5
b)   Run a few regression models between PC1 of all the esteem scores in 87 and suitable variables listed in a). Find a final best model with your **own clearly defined criterion**. 

We will conduct both a forward and backwards stepwise Multiple Linear Regression Model and choose the model which minimises MSE and maximises r-squared. Esteem_PC1 will be our            dependent variable, and the remaining factors our independent variables.

We first conduct a backward step regression model which starts with all the variables and removes the least significant, until removing more variables does not improve the performance     of the model.
## 
## Call:
## lm(formula = Esteem_PC1 ~ Education05 + Income87 + Job05 + Income05 + 
##     Inewspaper + Ilibrary + Intelligence, data = temp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.393 -1.558  0.002  1.672  5.136 
## 
## Coefficients:
##                                                                                           Estimate
## (Intercept)                                                                              -2.20e+00
## Education05                                                                               7.74e-02
## Income87                                                                                  1.30e-05
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                      5.42e-01
## Job051000 TO 1240: Mathematical and Computer Scientists                                   6.82e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians -1.27e-01
## Job051600 TO 1760: Physical Scientists                                                   -1.52e+00
## Job051800 TO 1860: Social Scientists and Related Workers                                 -6.00e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians                          2.37e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                              3.12e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                              1.24e-01
## Job052200 TO 2340: Teachers                                                               4.28e-01
## Job052400 TO 2550: Education, Training and Library Workers                                4.43e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                1.24e+00
## Job052800 TO 2960: Media and Communications Workers                                       5.16e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                           6.32e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations                         -1.80e-01
## Job053700 TO 3950: Protective Service Occupations                                         1.03e+00
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                      -1.68e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations                             -1.67e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers                          -1.14e+00
## Job054500 TO 4650: Personal Care and Service Workers                                      5.29e-01
## Job054700 TO 4960: Sales and Related Workers                                              4.31e-01
## Job05500 TO 950: Management Related Occupations                                           8.43e-01
## Job055000 TO 5930: Office and Administrative Support Workers                              4.92e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                              1.36e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers                              6.85e-02
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                          2.50e-01
## Job057700 TO 7750: Production and Operating Workers                                       2.51e-01
## Job057800 TO 7850: Food Preparation Occupations                                           4.59e-01
## Job057900 TO 8960: Setters, Operators and Tenders                                         2.94e-01
## Job059000 TO 9750: Transportation and Material Moving Workers                            -1.42e-01
## Job059990: Uncodeable                                                                     2.02e-02
## Income05                                                                                  4.61e-06
## Inewspaper                                                                                2.99e-01
## Ilibrary                                                                                  1.44e-01
## Intelligence                                                                              1.24e-01
##                                                                                          Std. Error
## (Intercept)                                                                                4.27e-01
## Education05                                                                                2.29e-02
## Income87                                                                                   3.87e-06
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                       2.91e-01
## Job051000 TO 1240: Mathematical and Computer Scientists                                    3.72e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians   3.88e-01
## Job051600 TO 1760: Physical Scientists                                                     1.04e+00
## Job051800 TO 1860: Social Scientists and Related Workers                                   8.67e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians                           8.08e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                               4.18e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                               5.93e-01
## Job052200 TO 2340: Teachers                                                                3.35e-01
## Job052400 TO 2550: Education, Training and Library Workers                                 4.62e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                 4.93e-01
## Job052800 TO 2960: Media and Communications Workers                                        6.21e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                            3.61e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations                           3.37e-01
## Job053700 TO 3950: Protective Service Occupations                                          3.84e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                        3.65e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations                               3.66e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers                            6.91e-01
## Job054500 TO 4650: Personal Care and Service Workers                                       4.12e-01
## Job054700 TO 4960: Sales and Related Workers                                               3.04e-01
## Job05500 TO 950: Management Related Occupations                                            3.33e-01
## Job055000 TO 5930: Office and Administrative Support Workers                               2.90e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                               7.26e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers                               3.21e-01
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                           3.33e-01
## Job057700 TO 7750: Production and Operating Workers                                        3.94e-01
## Job057800 TO 7850: Food Preparation Occupations                                            1.04e+00
## Job057900 TO 8960: Setters, Operators and Tenders                                          3.31e-01
## Job059000 TO 9750: Transportation and Material Moving Workers                              3.28e-01
## Job059990: Uncodeable                                                                      2.03e+00
## Income05                                                                                   1.04e-06
## Inewspaper                                                                                 1.27e-01
## Ilibrary                                                                                   1.02e-01
## Intelligence                                                                               2.04e-02
##                                                                                          t value
## (Intercept)                                                                                -5.14
## Education05                                                                                 3.38
## Income87                                                                                    3.36
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                        1.86
## Job051000 TO 1240: Mathematical and Computer Scientists                                     1.83
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians   -0.33
## Job051600 TO 1760: Physical Scientists                                                     -1.45
## Job051800 TO 1860: Social Scientists and Related Workers                                   -0.69
## Job051900 TO 1960: Life, Physical and Social Science Technicians                            0.29
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                                0.75
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                                0.21
## Job052200 TO 2340: Teachers                                                                 1.28
## Job052400 TO 2550: Education, Training and Library Workers                                  0.96
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                  2.52
## Job052800 TO 2960: Media and Communications Workers                                         0.83
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                             1.75
## Job053300 TO 3650: Health Care Technical and Support Occupations                           -0.53
## Job053700 TO 3950: Protective Service Occupations                                           2.68
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                        -0.46
## Job054200 TO 4250: Cleaning and Building Service Occupations                               -0.46
## Job054300 TO 4430: Entertainment Attendants and Related Workers                            -1.65
## Job054500 TO 4650: Personal Care and Service Workers                                        1.29
## Job054700 TO 4960: Sales and Related Workers                                                1.42
## Job05500 TO 950: Management Related Occupations                                             2.53
## Job055000 TO 5930: Office and Administrative Support Workers                                1.70
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                                0.19
## Job056200 TO 6940: Construction Trade and Extraction Workers                                0.21
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                            0.75
## Job057700 TO 7750: Production and Operating Workers                                         0.64
## Job057800 TO 7850: Food Preparation Occupations                                             0.44
## Job057900 TO 8960: Setters, Operators and Tenders                                           0.89
## Job059000 TO 9750: Transportation and Material Moving Workers                              -0.43
## Job059990: Uncodeable                                                                       0.01
## Income05                                                                                    4.43
## Inewspaper                                                                                  2.34
## Ilibrary                                                                                    1.41
## Intelligence                                                                                6.08
##                                                                                          Pr(>|t|)
## (Intercept)                                                                               3.0e-07
## Education05                                                                               0.00075
## Income87                                                                                  0.00079
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                      0.06237
## Job051000 TO 1240: Mathematical and Computer Scientists                                   0.06678
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians  0.74262
## Job051600 TO 1760: Physical Scientists                                                    0.14623
## Job051800 TO 1860: Social Scientists and Related Workers                                  0.48877
## Job051900 TO 1960: Life, Physical and Social Science Technicians                          0.76896
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                              0.45472
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                              0.83490
## Job052200 TO 2340: Teachers                                                               0.20125
## Job052400 TO 2550: Education, Training and Library Workers                                0.33756
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                0.01188
## Job052800 TO 2960: Media and Communications Workers                                       0.40623
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                           0.08001
## Job053300 TO 3650: Health Care Technical and Support Occupations                          0.59297
## Job053700 TO 3950: Protective Service Occupations                                         0.00751
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                       0.64525
## Job054200 TO 4250: Cleaning and Building Service Occupations                              0.64848
## Job054300 TO 4430: Entertainment Attendants and Related Workers                           0.09992
## Job054500 TO 4650: Personal Care and Service Workers                                      0.19870
## Job054700 TO 4960: Sales and Related Workers                                              0.15605
## Job05500 TO 950: Management Related Occupations                                           0.01151
## Job055000 TO 5930: Office and Administrative Support Workers                              0.08928
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                              0.85173
## Job056200 TO 6940: Construction Trade and Extraction Workers                              0.83113
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                          0.45227
## Job057700 TO 7750: Production and Operating Workers                                       0.52489
## Job057800 TO 7850: Food Preparation Occupations                                           0.66006
## Job057900 TO 8960: Setters, Operators and Tenders                                         0.37412
## Job059000 TO 9750: Transportation and Material Moving Workers                             0.66501
## Job059990: Uncodeable                                                                     0.99207
## Income05                                                                                  9.7e-06
## Inewspaper                                                                                0.01918
## Ilibrary                                                                                  0.15800
## Intelligence                                                                              1.4e-09
##                                                                                             
## (Intercept)                                                                              ***
## Education05                                                                              ***
## Income87                                                                                 ***
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                     .  
## Job051000 TO 1240: Mathematical and Computer Scientists                                  .  
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians    
## Job051600 TO 1760: Physical Scientists                                                      
## Job051800 TO 1860: Social Scientists and Related Workers                                    
## Job051900 TO 1960: Life, Physical and Social Science Technicians                            
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                                
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                                
## Job052200 TO 2340: Teachers                                                                 
## Job052400 TO 2550: Education, Training and Library Workers                                  
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers               *  
## Job052800 TO 2960: Media and Communications Workers                                         
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                          .  
## Job053300 TO 3650: Health Care Technical and Support Occupations                            
## Job053700 TO 3950: Protective Service Occupations                                        ** 
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                         
## Job054200 TO 4250: Cleaning and Building Service Occupations                                
## Job054300 TO 4430: Entertainment Attendants and Related Workers                          .  
## Job054500 TO 4650: Personal Care and Service Workers                                        
## Job054700 TO 4960: Sales and Related Workers                                                
## Job05500 TO 950: Management Related Occupations                                          *  
## Job055000 TO 5930: Office and Administrative Support Workers                             .  
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                                
## Job056200 TO 6940: Construction Trade and Extraction Workers                                
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                            
## Job057700 TO 7750: Production and Operating Workers                                         
## Job057800 TO 7850: Food Preparation Occupations                                             
## Job057900 TO 8960: Setters, Operators and Tenders                                           
## Job059000 TO 9750: Transportation and Material Moving Workers                               
## Job059990: Uncodeable                                                                       
## Income05                                                                                 ***
## Inewspaper                                                                               *  
## Ilibrary                                                                                    
## Intelligence                                                                             ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.01 on 2394 degrees of freedom
## Multiple R-squared:  0.15,   Adjusted R-squared:  0.137 
## F-statistic: 11.7 on 36 and 2394 DF,  p-value: <2e-16
## Esteem_PC1 ~ Education05 + Income87 + Job05 + Income05 + Inewspaper + 
##     Ilibrary + Intelligence
This process came up with the model: -2.20 + 0.0774(Education05) + 0.000013(Income87) + 1.24(Entertainers and Performers, Sports and Related Workers) + 1.03(Protective Service            Occupations) + 0.843(Management Related Occupations) + 0.00000461(Income05) + 0.299(Inewspaper) + 0.124(Intelligence). In this model, we have identified the jobs that best improve        model fit relative to the baseline job, removing the less significant occupations and simplifying the model.

This has an r-squared of 0.15, meaning 15% of variation in Esteem_PC1 can be explained by variation in the independent variables. Furthermore, we calculate an F-statistic of 11.7 and     a corresponding p-value less than 2x10^-16, indicating that the overall model is significant in explaining variation in the dependent variable. Finally, we calculate a residual           standard error of 2.01, indicating that on average, datapoints are 2.01 standard deviations away from the regression line.

Next, we will conduct a forward step-wise regression and compare the effectiveness of the model.
## 
## Call:
## lm(formula = Esteem_PC1 ~ Intelligence + Income05 + Education05 + 
##     Income87 + Inewspaper + Job05 + Ilibrary, data = temp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.393 -1.558  0.002  1.672  5.136 
## 
## Coefficients:
##                                                                                           Estimate
## (Intercept)                                                                              -2.20e+00
## Intelligence                                                                              1.24e-01
## Income05                                                                                  4.61e-06
## Education05                                                                               7.74e-02
## Income87                                                                                  1.30e-05
## Inewspaper                                                                                2.99e-01
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                      5.42e-01
## Job051000 TO 1240: Mathematical and Computer Scientists                                   6.82e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians -1.27e-01
## Job051600 TO 1760: Physical Scientists                                                   -1.52e+00
## Job051800 TO 1860: Social Scientists and Related Workers                                 -6.00e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians                          2.37e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                              3.12e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                              1.24e-01
## Job052200 TO 2340: Teachers                                                               4.28e-01
## Job052400 TO 2550: Education, Training and Library Workers                                4.43e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                1.24e+00
## Job052800 TO 2960: Media and Communications Workers                                       5.16e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                           6.32e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations                         -1.80e-01
## Job053700 TO 3950: Protective Service Occupations                                         1.03e+00
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                      -1.68e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations                             -1.67e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers                          -1.14e+00
## Job054500 TO 4650: Personal Care and Service Workers                                      5.29e-01
## Job054700 TO 4960: Sales and Related Workers                                              4.31e-01
## Job05500 TO 950: Management Related Occupations                                           8.43e-01
## Job055000 TO 5930: Office and Administrative Support Workers                              4.92e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                              1.36e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers                              6.85e-02
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                          2.50e-01
## Job057700 TO 7750: Production and Operating Workers                                       2.51e-01
## Job057800 TO 7850: Food Preparation Occupations                                           4.59e-01
## Job057900 TO 8960: Setters, Operators and Tenders                                         2.94e-01
## Job059000 TO 9750: Transportation and Material Moving Workers                            -1.42e-01
## Job059990: Uncodeable                                                                     2.02e-02
## Ilibrary                                                                                  1.44e-01
##                                                                                          Std. Error
## (Intercept)                                                                                4.27e-01
## Intelligence                                                                               2.04e-02
## Income05                                                                                   1.04e-06
## Education05                                                                                2.29e-02
## Income87                                                                                   3.87e-06
## Inewspaper                                                                                 1.27e-01
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                       2.91e-01
## Job051000 TO 1240: Mathematical and Computer Scientists                                    3.72e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians   3.88e-01
## Job051600 TO 1760: Physical Scientists                                                     1.04e+00
## Job051800 TO 1860: Social Scientists and Related Workers                                   8.67e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians                           8.08e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                               4.18e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                               5.93e-01
## Job052200 TO 2340: Teachers                                                                3.35e-01
## Job052400 TO 2550: Education, Training and Library Workers                                 4.62e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                 4.93e-01
## Job052800 TO 2960: Media and Communications Workers                                        6.21e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                            3.61e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations                           3.37e-01
## Job053700 TO 3950: Protective Service Occupations                                          3.84e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                        3.65e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations                               3.66e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers                            6.91e-01
## Job054500 TO 4650: Personal Care and Service Workers                                       4.12e-01
## Job054700 TO 4960: Sales and Related Workers                                               3.04e-01
## Job05500 TO 950: Management Related Occupations                                            3.33e-01
## Job055000 TO 5930: Office and Administrative Support Workers                               2.90e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                               7.26e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers                               3.21e-01
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                           3.33e-01
## Job057700 TO 7750: Production and Operating Workers                                        3.94e-01
## Job057800 TO 7850: Food Preparation Occupations                                            1.04e+00
## Job057900 TO 8960: Setters, Operators and Tenders                                          3.31e-01
## Job059000 TO 9750: Transportation and Material Moving Workers                              3.28e-01
## Job059990: Uncodeable                                                                      2.03e+00
## Ilibrary                                                                                   1.02e-01
##                                                                                          t value
## (Intercept)                                                                                -5.14
## Intelligence                                                                                6.08
## Income05                                                                                    4.43
## Education05                                                                                 3.38
## Income87                                                                                    3.36
## Inewspaper                                                                                  2.34
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                        1.86
## Job051000 TO 1240: Mathematical and Computer Scientists                                     1.83
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians   -0.33
## Job051600 TO 1760: Physical Scientists                                                     -1.45
## Job051800 TO 1860: Social Scientists and Related Workers                                   -0.69
## Job051900 TO 1960: Life, Physical and Social Science Technicians                            0.29
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                                0.75
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                                0.21
## Job052200 TO 2340: Teachers                                                                 1.28
## Job052400 TO 2550: Education, Training and Library Workers                                  0.96
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                  2.52
## Job052800 TO 2960: Media and Communications Workers                                         0.83
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                             1.75
## Job053300 TO 3650: Health Care Technical and Support Occupations                           -0.53
## Job053700 TO 3950: Protective Service Occupations                                           2.68
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                        -0.46
## Job054200 TO 4250: Cleaning and Building Service Occupations                               -0.46
## Job054300 TO 4430: Entertainment Attendants and Related Workers                            -1.65
## Job054500 TO 4650: Personal Care and Service Workers                                        1.29
## Job054700 TO 4960: Sales and Related Workers                                                1.42
## Job05500 TO 950: Management Related Occupations                                             2.53
## Job055000 TO 5930: Office and Administrative Support Workers                                1.70
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                                0.19
## Job056200 TO 6940: Construction Trade and Extraction Workers                                0.21
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                            0.75
## Job057700 TO 7750: Production and Operating Workers                                         0.64
## Job057800 TO 7850: Food Preparation Occupations                                             0.44
## Job057900 TO 8960: Setters, Operators and Tenders                                           0.89
## Job059000 TO 9750: Transportation and Material Moving Workers                              -0.43
## Job059990: Uncodeable                                                                       0.01
## Ilibrary                                                                                    1.41
##                                                                                          Pr(>|t|)
## (Intercept)                                                                               3.0e-07
## Intelligence                                                                              1.4e-09
## Income05                                                                                  9.7e-06
## Education05                                                                               0.00075
## Income87                                                                                  0.00079
## Inewspaper                                                                                0.01918
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                      0.06237
## Job051000 TO 1240: Mathematical and Computer Scientists                                   0.06678
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians  0.74262
## Job051600 TO 1760: Physical Scientists                                                    0.14623
## Job051800 TO 1860: Social Scientists and Related Workers                                  0.48877
## Job051900 TO 1960: Life, Physical and Social Science Technicians                          0.76896
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                              0.45472
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                              0.83490
## Job052200 TO 2340: Teachers                                                               0.20125
## Job052400 TO 2550: Education, Training and Library Workers                                0.33756
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers                0.01188
## Job052800 TO 2960: Media and Communications Workers                                       0.40623
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                           0.08001
## Job053300 TO 3650: Health Care Technical and Support Occupations                          0.59297
## Job053700 TO 3950: Protective Service Occupations                                         0.00751
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                       0.64525
## Job054200 TO 4250: Cleaning and Building Service Occupations                              0.64848
## Job054300 TO 4430: Entertainment Attendants and Related Workers                           0.09992
## Job054500 TO 4650: Personal Care and Service Workers                                      0.19870
## Job054700 TO 4960: Sales and Related Workers                                              0.15605
## Job05500 TO 950: Management Related Occupations                                           0.01151
## Job055000 TO 5930: Office and Administrative Support Workers                              0.08928
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                              0.85173
## Job056200 TO 6940: Construction Trade and Extraction Workers                              0.83113
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                          0.45227
## Job057700 TO 7750: Production and Operating Workers                                       0.52489
## Job057800 TO 7850: Food Preparation Occupations                                           0.66006
## Job057900 TO 8960: Setters, Operators and Tenders                                         0.37412
## Job059000 TO 9750: Transportation and Material Moving Workers                             0.66501
## Job059990: Uncodeable                                                                     0.99207
## Ilibrary                                                                                  0.15800
##                                                                                             
## (Intercept)                                                                              ***
## Intelligence                                                                             ***
## Income05                                                                                 ***
## Education05                                                                              ***
## Income87                                                                                 ***
## Inewspaper                                                                               *  
## Job0510 TO 430: Executive, Administrative and Managerial Occupations                     .  
## Job051000 TO 1240: Mathematical and Computer Scientists                                  .  
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians    
## Job051600 TO 1760: Physical Scientists                                                      
## Job051800 TO 1860: Social Scientists and Related Workers                                    
## Job051900 TO 1960: Life, Physical and Social Science Technicians                            
## Job052000 TO 2060: Counselors, Sociala and Religious Workers                                
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers                                
## Job052200 TO 2340: Teachers                                                                 
## Job052400 TO 2550: Education, Training and Library Workers                                  
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers               *  
## Job052800 TO 2960: Media and Communications Workers                                         
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners                          .  
## Job053300 TO 3650: Health Care Technical and Support Occupations                            
## Job053700 TO 3950: Protective Service Occupations                                        ** 
## Job054000 TO 4160: Food Preparation and Serving Related Occupations                         
## Job054200 TO 4250: Cleaning and Building Service Occupations                                
## Job054300 TO 4430: Entertainment Attendants and Related Workers                          .  
## Job054500 TO 4650: Personal Care and Service Workers                                        
## Job054700 TO 4960: Sales and Related Workers                                                
## Job05500 TO 950: Management Related Occupations                                          *  
## Job055000 TO 5930: Office and Administrative Support Workers                             .  
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations                                
## Job056200 TO 6940: Construction Trade and Extraction Workers                                
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers                            
## Job057700 TO 7750: Production and Operating Workers                                         
## Job057800 TO 7850: Food Preparation Occupations                                             
## Job057900 TO 8960: Setters, Operators and Tenders                                           
## Job059000 TO 9750: Transportation and Material Moving Workers                               
## Job059990: Uncodeable                                                                       
## Ilibrary                                                                                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.01 on 2394 degrees of freedom
## Multiple R-squared:  0.15,   Adjusted R-squared:  0.137 
## F-statistic: 11.7 on 36 and 2394 DF,  p-value: <2e-16
## Esteem_PC1 ~ Intelligence + Income05 + Education05 + Income87 + 
##     Inewspaper + Job05 + Ilibrary
This process came up with the exact same model: -2.20 + 0.0774(Education05) + 0.000013(Income87) + 1.24(Entertainers and Performers, Sports and Related Workers) + 1.03(Protective         Service Occupations) + 0.843(Management Related Occupations) + 0.00000461(Income05) + 0.299(Inewspaper) + 0.124(Intelligence). In this model, we have identified the jobs that best        improve model fit relative to the baseline job, removing the less significant occupations and simplifying the model.

Similarly, this has an r-squared of 0.15, meaning 15% of variation in Esteem_PC1 can be explained by variation in the independent variables. Furthermore, we calculate an F-statistic      of 11.7 and a corresponding p-value less than 2x10^-16, indicating that the overall model is significant in explaining variation in the dependent variable. Finally, we calculate a        residual standard error of 2.01, indicating that on average, datapoints are 2.01 standard deviations away from the regression line.

Finally, we will conduct an exhaustive search, which invovles testing every possible combination of independent variables in a regression model, and selecting the one with the lowest     AIC.
## 
## Call:
## lm(formula = best_formula, data = df_exh)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.910 -1.617  0.034  1.680  4.791 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.38e+00   3.27e-01   -7.29  4.3e-13 ***
## Education05   9.46e-02   2.08e-02    4.55  5.7e-06 ***
## Income87      1.37e-05   3.85e-06    3.56  0.00038 ***
## Income05      4.81e-06   1.00e-06    4.81  1.6e-06 ***
## Inewspaper    2.50e-01   1.29e-01    1.94  0.05193 .  
## Ilibrary      1.56e-01   1.03e-01    1.52  0.12921    
## MotherEd      2.65e-02   1.86e-02    1.42  0.15461    
## Intelligence  1.28e-01   2.05e-02    6.21  6.2e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.03 on 2423 degrees of freedom
## Multiple R-squared:  0.128,  Adjusted R-squared:  0.126 
## F-statistic: 50.9 on 7 and 2423 DF,  p-value: <2e-16
This process found the most efficient model was: -2.38 + 0.0946(Education05) + 0.0000137(Income87) + 0.00000481(Income05) + 0.25(Inewspaper) + 0.156(Ilibrary) + 0.0265(MotherEd) +        0.128(Intelligence). 

Similarly, this has an r-squared of 0.128, meaning 12.8% of variation in Esteem_PC1 can be explained by variation in the independent variables. Furthermore, we calculate an               F-statistic of 50.9 and a corresponding p-value less than 2x10^-16, indicating that the overall model is significant in explaining variation in the dependent variable. Finally, we        calculate a residual standard error of 2.03, indicating that on average, datapoints are 2.01 standard deviations away from the regression line.

Hence, looking at these  different models, we select the forward / backward stepwise regression model because it has a higher r-squared statistic, whilst being similar on RSE and         F-statistic significance as the exhaustive model.

To test the normality assumption, we can look at the QQ-Plot. The points lie very close to the diagonal line, even if there are slight curves at the lower and upper tails, however, we     can assume that the normality assumption is met. For linearity, we are looking for an even, random distribution of points above and below zero. There is an even distribution of points     before fitted value -1, however, after this, there is a clear, linear decrease in points converging at zero as the fitted values get more positive. As such, we cannot conclude that       the linearlity assumption holds. Finally, looking at the vertical spread of points in the Residuals vs. Fitted plot, we do not see an even vertical spread of points, because it seems     to be converging at zero as the fitted values get more positive. Consequently, this assumptions does not hold.

Thus, the normality assumption holds, but the linearity and homoskedasticity assumptions do not hold.

Looking at the final model, we can conclude that the variables that most affect one's self-esteem is Education05; Income87; Entertainers and Performers, Sports and Relalted Workers;      Protective Service Occupations; Inewspaper; Intelligence; Income05; and Management Related Occupations. Thus, holding all other independent variables constant, for every increase in      one unit of:
  - Education05, Self-Esteem will increase by 0.0774 on average.
  - Income87, Self-Esteem will increase 0.000013 on average.
  - Income05, Self-Esteem will increase 0.00000461 on average.
  - Inewspaper, Self-Esteem will increase 0.299 on average.
  - Intelligence, Self-Esteem will increase 0.124 on average.

Similarly, if participants were in these jobs, they experienced an increase in Self-Esteem of:
  - 1.24 for Entertainers and Performers, Sports and Related Workers
  - 1.03 for Protective Service Occupations
  - 0.843 for Management Related Occupations
    

2 Case study 2: Breast cancer sub-type

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program by National Cancer Institute (NCI), molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The genome data is open to public from the Genomic Data Commons Data Portal (GDC).

In this study, we focus on 4 sub-types of breast cancer (BRCA): basal-like (basal), Luminal A-like (lumA), Luminal B-like (lumB), HER2-enriched. The sub-type is based on PAM50, a clinical-grade luminal-basal classifier. (We had hoped to download the data for control groups for each type of the cancer. But failed to do so. Please let us know if you find the appropriate data.)

  • Luminal A cancers are low-grade, tend to grow slowly and have the best prognosis.
  • Luminal B cancers generally grow slightly faster than luminal A cancers and their prognosis is slightly worse.
  • HER2-enriched cancers tend to grow faster than luminal cancers and can have a worse prognosis, but they are often successfully treated with targeted therapies aimed at the HER2 protein.
  • Basal-like breast cancers or triple negative breast cancers do not have the three receptors that the other sub-types have so have fewer treatment options.

We will try to use mRNA expression data alone without the labels to classify 4 sub-types. Classification without labels or prediction without outcomes is called unsupervised learning. We will use K-means and spectrum clustering to cluster the mRNA data and see whether the sub-type can be separated through mRNA data.

We first read the data using data.table::fread() which is a faster way to read in big data than read.csv().

  1. Summary and transformation

    1. How many patients are there in each sub-type?

    2. Randomly pick 5 genes and plot the histogram by each sub-type.

    3. Clean and transform the mRNA sequences by first remove gene with zero count and no variability and then apply logarithmic transform.

  2. Apply kmeans on the transformed dataset with 4 centers (4 clusters) and output the discrepancy table between the real sub-type brca_subtype and the cluster labels.

  3. Spectrum clustering: to scale or not to scale?

    1. Apply PCA on the centered and scaled dataset. How many PCs should we use and why? You are encouraged to use irlba::irlba(). In order to do so please review the section about SVD in PCA module.

    2. Plot PC1 vs PC2 of the centered and scaled data and PC1 vs PC2 of the centered but unscaled data side by side. Should we scale or not scale for clustering process? Why?

  4. Spectrum clustering: center but do not scale the data

    1. Use the first 4 PCs of the centered and unscaled data and apply kmeans. Find a reasonable number of clusters using within sum of squared with the elbow rule.

    2. Choose an optimal cluster number and apply kmeans. Compare the real sub-type and the clustering label as follows: Plot scatter plot of PC1 vs PC2. Use point color to indicate the true cancer type and point shape to indicate the clustering label. Plot the kmeans centroids with black dots. Summarize how good is clustering results compared to the real sub-type.

    3. Compare the clustering result from applying kmeans to the original data and the clustering result from applying kmeans to 4 PCs. Does PCA help in kmeans clustering? What might be the reasons if PCA helps?

    4. Now we have an x patient with breast cancer but with unknown sub-type. We have this patient’s mRNA sequencing data. Project this x patient to the space of PC1 and PC2. (Hint: remember we remove some gene with no counts or no variablity, take log and centered, then find its PC1 to PC4 scores) Plot this patient in the plot in b) with a black dot as well. Calculate the Euclidean distance between this patient and each of the centroid of the cluster. (Don’t forget the clusters are obtained by using 4 PC’s) Can you tell which sub-type this patient might have?

3 Case Study 3: Fuel Efficiency in Automobiles

What determines how fuel efficient a car is? Are Japanese cars more fuel efficient? To answer thes questions we will build various linear models using the Auto dataset from the book ISLR. The original dataset contains information for about 400 different cars built in various years. To get the data, first install the package ISLR which has been done in the first R-chunk. The Auto dataset should be loaded automatically. Original data source is here: https://archive.ics.uci.edu/ml/datasets/auto+mpg

Get familiar with this dataset first. Tip: you can use the command ?ISLR::Auto to view a description of the dataset. Our response variable will me MPG: miles per gallon.

3.1 EDA

##       mpg         cylinders     displacement   horsepower        weight    
##  Min.   : 9.0   Min.   :3.00   Min.   : 68   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.0   1st Qu.:4.00   1st Qu.:105   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.8   Median :4.00   Median :151   Median : 93.5   Median :2804  
##  Mean   :23.4   Mean   :5.47   Mean   :194   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.0   3rd Qu.:8.00   3rd Qu.:276   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.6   Max.   :8.00   Max.   :455   Max.   :230.0   Max.   :5140  
##                                                                            
##   acceleration       year        origin                     name    
##  Min.   : 8.0   Min.   :70   Min.   :1.00   amc matador       :  5  
##  1st Qu.:13.8   1st Qu.:73   1st Qu.:1.00   ford pinto        :  5  
##  Median :15.5   Median :76   Median :1.00   toyota corolla    :  5  
##  Mean   :15.5   Mean   :76   Mean   :1.58   amc gremlin       :  4  
##  3rd Qu.:17.0   3rd Qu.:79   3rd Qu.:2.00   amc hornet        :  4  
##  Max.   :24.8   Max.   :82   Max.   :3.00   chevrolet chevette:  4  
##                                             (Other)           :365

##      mpg cylinders displacement horsepower weight acceleration year origin
## 1   18.0         8        307.0        130   3504         12.0   70      1
## 2   15.0         8        350.0        165   3693         11.5   70      1
## 3   18.0         8        318.0        150   3436         11.0   70      1
## 4   16.0         8        304.0        150   3433         12.0   70      1
## 5   17.0         8        302.0        140   3449         10.5   70      1
## 6   15.0         8        429.0        198   4341         10.0   70      1
## 7   14.0         8        454.0        220   4354          9.0   70      1
## 8   14.0         8        440.0        215   4312          8.5   70      1
## 9   14.0         8        455.0        225   4425         10.0   70      1
## 10  15.0         8        390.0        190   3850          8.5   70      1
## 11  15.0         8        383.0        170   3563         10.0   70      1
## 12  14.0         8        340.0        160   3609          8.0   70      1
## 13  15.0         8        400.0        150   3761          9.5   70      1
## 14  14.0         8        455.0        225   3086         10.0   70      1
## 15  24.0         4        113.0         95   2372         15.0   70      3
## 16  22.0         6        198.0         95   2833         15.5   70      1
## 17  18.0         6        199.0         97   2774         15.5   70      1
## 18  21.0         6        200.0         85   2587         16.0   70      1
## 19  27.0         4         97.0         88   2130         14.5   70      3
## 20  26.0         4         97.0         46   1835         20.5   70      2
## 21  25.0         4        110.0         87   2672         17.5   70      2
## 22  24.0         4        107.0         90   2430         14.5   70      2
## 23  25.0         4        104.0         95   2375         17.5   70      2
## 24  26.0         4        121.0        113   2234         12.5   70      2
## 25  21.0         6        199.0         90   2648         15.0   70      1
## 26  10.0         8        360.0        215   4615         14.0   70      1
## 27  10.0         8        307.0        200   4376         15.0   70      1
## 28  11.0         8        318.0        210   4382         13.5   70      1
## 29   9.0         8        304.0        193   4732         18.5   70      1
## 30  27.0         4         97.0         88   2130         14.5   71      3
## 31  28.0         4        140.0         90   2264         15.5   71      1
## 32  25.0         4        113.0         95   2228         14.0   71      3
## 34  19.0         6        232.0        100   2634         13.0   71      1
## 35  16.0         6        225.0        105   3439         15.5   71      1
## 36  17.0         6        250.0        100   3329         15.5   71      1
## 37  19.0         6        250.0         88   3302         15.5   71      1
## 38  18.0         6        232.0        100   3288         15.5   71      1
## 39  14.0         8        350.0        165   4209         12.0   71      1
## 40  14.0         8        400.0        175   4464         11.5   71      1
## 41  14.0         8        351.0        153   4154         13.5   71      1
## 42  14.0         8        318.0        150   4096         13.0   71      1
## 43  12.0         8        383.0        180   4955         11.5   71      1
## 44  13.0         8        400.0        170   4746         12.0   71      1
## 45  13.0         8        400.0        175   5140         12.0   71      1
## 46  18.0         6        258.0        110   2962         13.5   71      1
## 47  22.0         4        140.0         72   2408         19.0   71      1
## 48  19.0         6        250.0        100   3282         15.0   71      1
## 49  18.0         6        250.0         88   3139         14.5   71      1
## 50  23.0         4        122.0         86   2220         14.0   71      1
## 51  28.0         4        116.0         90   2123         14.0   71      2
## 52  30.0         4         79.0         70   2074         19.5   71      2
## 53  30.0         4         88.0         76   2065         14.5   71      2
## 54  31.0         4         71.0         65   1773         19.0   71      3
## 55  35.0         4         72.0         69   1613         18.0   71      3
## 56  27.0         4         97.0         60   1834         19.0   71      2
## 57  26.0         4         91.0         70   1955         20.5   71      1
## 58  24.0         4        113.0         95   2278         15.5   72      3
## 59  25.0         4         97.5         80   2126         17.0   72      1
## 60  23.0         4         97.0         54   2254         23.5   72      2
## 61  20.0         4        140.0         90   2408         19.5   72      1
## 62  21.0         4        122.0         86   2226         16.5   72      1
## 63  13.0         8        350.0        165   4274         12.0   72      1
## 64  14.0         8        400.0        175   4385         12.0   72      1
## 65  15.0         8        318.0        150   4135         13.5   72      1
## 66  14.0         8        351.0        153   4129         13.0   72      1
## 67  17.0         8        304.0        150   3672         11.5   72      1
## 68  11.0         8        429.0        208   4633         11.0   72      1
## 69  13.0         8        350.0        155   4502         13.5   72      1
## 70  12.0         8        350.0        160   4456         13.5   72      1
## 71  13.0         8        400.0        190   4422         12.5   72      1
## 72  19.0         3         70.0         97   2330         13.5   72      3
## 73  15.0         8        304.0        150   3892         12.5   72      1
## 74  13.0         8        307.0        130   4098         14.0   72      1
## 75  13.0         8        302.0        140   4294         16.0   72      1
## 76  14.0         8        318.0        150   4077         14.0   72      1
## 77  18.0         4        121.0        112   2933         14.5   72      2
## 78  22.0         4        121.0         76   2511         18.0   72      2
## 79  21.0         4        120.0         87   2979         19.5   72      2
## 80  26.0         4         96.0         69   2189         18.0   72      2
## 81  22.0         4        122.0         86   2395         16.0   72      1
## 82  28.0         4         97.0         92   2288         17.0   72      3
## 83  23.0         4        120.0         97   2506         14.5   72      3
## 84  28.0         4         98.0         80   2164         15.0   72      1
## 85  27.0         4         97.0         88   2100         16.5   72      3
## 86  13.0         8        350.0        175   4100         13.0   73      1
## 87  14.0         8        304.0        150   3672         11.5   73      1
## 88  13.0         8        350.0        145   3988         13.0   73      1
## 89  14.0         8        302.0        137   4042         14.5   73      1
## 90  15.0         8        318.0        150   3777         12.5   73      1
## 91  12.0         8        429.0        198   4952         11.5   73      1
## 92  13.0         8        400.0        150   4464         12.0   73      1
## 93  13.0         8        351.0        158   4363         13.0   73      1
## 94  14.0         8        318.0        150   4237         14.5   73      1
## 95  13.0         8        440.0        215   4735         11.0   73      1
## 96  12.0         8        455.0        225   4951         11.0   73      1
## 97  13.0         8        360.0        175   3821         11.0   73      1
## 98  18.0         6        225.0        105   3121         16.5   73      1
## 99  16.0         6        250.0        100   3278         18.0   73      1
## 100 18.0         6        232.0        100   2945         16.0   73      1
## 101 18.0         6        250.0         88   3021         16.5   73      1
## 102 23.0         6        198.0         95   2904         16.0   73      1
## 103 26.0         4         97.0         46   1950         21.0   73      2
## 104 11.0         8        400.0        150   4997         14.0   73      1
## 105 12.0         8        400.0        167   4906         12.5   73      1
## 106 13.0         8        360.0        170   4654         13.0   73      1
## 107 12.0         8        350.0        180   4499         12.5   73      1
## 108 18.0         6        232.0        100   2789         15.0   73      1
## 109 20.0         4         97.0         88   2279         19.0   73      3
## 110 21.0         4        140.0         72   2401         19.5   73      1
## 111 22.0         4        108.0         94   2379         16.5   73      3
## 112 18.0         3         70.0         90   2124         13.5   73      3
## 113 19.0         4        122.0         85   2310         18.5   73      1
## 114 21.0         6        155.0        107   2472         14.0   73      1
## 115 26.0         4         98.0         90   2265         15.5   73      2
## 116 15.0         8        350.0        145   4082         13.0   73      1
## 117 16.0         8        400.0        230   4278          9.5   73      1
## 118 29.0         4         68.0         49   1867         19.5   73      2
## 119 24.0         4        116.0         75   2158         15.5   73      2
## 120 20.0         4        114.0         91   2582         14.0   73      2
## 121 19.0         4        121.0        112   2868         15.5   73      2
## 122 15.0         8        318.0        150   3399         11.0   73      1
## 123 24.0         4        121.0        110   2660         14.0   73      2
## 124 20.0         6        156.0        122   2807         13.5   73      3
## 125 11.0         8        350.0        180   3664         11.0   73      1
## 126 20.0         6        198.0         95   3102         16.5   74      1
## 128 19.0         6        232.0        100   2901         16.0   74      1
## 129 15.0         6        250.0        100   3336         17.0   74      1
## 130 31.0         4         79.0         67   1950         19.0   74      3
## 131 26.0         4        122.0         80   2451         16.5   74      1
## 132 32.0         4         71.0         65   1836         21.0   74      3
## 133 25.0         4        140.0         75   2542         17.0   74      1
## 134 16.0         6        250.0        100   3781         17.0   74      1
## 135 16.0         6        258.0        110   3632         18.0   74      1
## 136 18.0         6        225.0        105   3613         16.5   74      1
## 137 16.0         8        302.0        140   4141         14.0   74      1
## 138 13.0         8        350.0        150   4699         14.5   74      1
## 139 14.0         8        318.0        150   4457         13.5   74      1
## 140 14.0         8        302.0        140   4638         16.0   74      1
## 141 14.0         8        304.0        150   4257         15.5   74      1
## 142 29.0         4         98.0         83   2219         16.5   74      2
## 143 26.0         4         79.0         67   1963         15.5   74      2
## 144 26.0         4         97.0         78   2300         14.5   74      2
## 145 31.0         4         76.0         52   1649         16.5   74      3
## 146 32.0         4         83.0         61   2003         19.0   74      3
## 147 28.0         4         90.0         75   2125         14.5   74      1
## 148 24.0         4         90.0         75   2108         15.5   74      2
## 149 26.0         4        116.0         75   2246         14.0   74      2
## 150 24.0         4        120.0         97   2489         15.0   74      3
## 151 26.0         4        108.0         93   2391         15.5   74      3
## 152 31.0         4         79.0         67   2000         16.0   74      2
## 153 19.0         6        225.0         95   3264         16.0   75      1
## 154 18.0         6        250.0        105   3459         16.0   75      1
## 155 15.0         6        250.0         72   3432         21.0   75      1
## 156 15.0         6        250.0         72   3158         19.5   75      1
## 157 16.0         8        400.0        170   4668         11.5   75      1
## 158 15.0         8        350.0        145   4440         14.0   75      1
## 159 16.0         8        318.0        150   4498         14.5   75      1
## 160 14.0         8        351.0        148   4657         13.5   75      1
## 161 17.0         6        231.0        110   3907         21.0   75      1
## 162 16.0         6        250.0        105   3897         18.5   75      1
## 163 15.0         6        258.0        110   3730         19.0   75      1
## 164 18.0         6        225.0         95   3785         19.0   75      1
## 165 21.0         6        231.0        110   3039         15.0   75      1
## 166 20.0         8        262.0        110   3221         13.5   75      1
## 167 13.0         8        302.0        129   3169         12.0   75      1
## 168 29.0         4         97.0         75   2171         16.0   75      3
## 169 23.0         4        140.0         83   2639         17.0   75      1
## 170 20.0         6        232.0        100   2914         16.0   75      1
## 171 23.0         4        140.0         78   2592         18.5   75      1
## 172 24.0         4        134.0         96   2702         13.5   75      3
## 173 25.0         4         90.0         71   2223         16.5   75      2
## 174 24.0         4        119.0         97   2545         17.0   75      3
## 175 18.0         6        171.0         97   2984         14.5   75      1
## 176 29.0         4         90.0         70   1937         14.0   75      2
## 177 19.0         6        232.0         90   3211         17.0   75      1
## 178 23.0         4        115.0         95   2694         15.0   75      2
## 179 23.0         4        120.0         88   2957         17.0   75      2
## 180 22.0         4        121.0         98   2945         14.5   75      2
## 181 25.0         4        121.0        115   2671         13.5   75      2
## 182 33.0         4         91.0         53   1795         17.5   75      3
## 183 28.0         4        107.0         86   2464         15.5   76      2
## 184 25.0         4        116.0         81   2220         16.9   76      2
## 185 25.0         4        140.0         92   2572         14.9   76      1
## 186 26.0         4         98.0         79   2255         17.7   76      1
## 187 27.0         4        101.0         83   2202         15.3   76      2
## 188 17.5         8        305.0        140   4215         13.0   76      1
## 189 16.0         8        318.0        150   4190         13.0   76      1
## 190 15.5         8        304.0        120   3962         13.9   76      1
## 191 14.5         8        351.0        152   4215         12.8   76      1
## 192 22.0         6        225.0        100   3233         15.4   76      1
## 193 22.0         6        250.0        105   3353         14.5   76      1
## 194 24.0         6        200.0         81   3012         17.6   76      1
## 195 22.5         6        232.0         90   3085         17.6   76      1
## 196 29.0         4         85.0         52   2035         22.2   76      1
## 197 24.5         4         98.0         60   2164         22.1   76      1
## 198 29.0         4         90.0         70   1937         14.2   76      2
## 199 33.0         4         91.0         53   1795         17.4   76      3
## 200 20.0         6        225.0        100   3651         17.7   76      1
## 201 18.0         6        250.0         78   3574         21.0   76      1
## 202 18.5         6        250.0        110   3645         16.2   76      1
## 203 17.5         6        258.0         95   3193         17.8   76      1
## 204 29.5         4         97.0         71   1825         12.2   76      2
## 205 32.0         4         85.0         70   1990         17.0   76      3
## 206 28.0         4         97.0         75   2155         16.4   76      3
## 207 26.5         4        140.0         72   2565         13.6   76      1
## 208 20.0         4        130.0        102   3150         15.7   76      2
## 209 13.0         8        318.0        150   3940         13.2   76      1
## 210 19.0         4        120.0         88   3270         21.9   76      2
## 211 19.0         6        156.0        108   2930         15.5   76      3
## 212 16.5         6        168.0        120   3820         16.7   76      2
## 213 16.5         8        350.0        180   4380         12.1   76      1
## 214 13.0         8        350.0        145   4055         12.0   76      1
## 215 13.0         8        302.0        130   3870         15.0   76      1
## 216 13.0         8        318.0        150   3755         14.0   76      1
## 217 31.5         4         98.0         68   2045         18.5   77      3
## 218 30.0         4        111.0         80   2155         14.8   77      1
## 219 36.0         4         79.0         58   1825         18.6   77      2
## 220 25.5         4        122.0         96   2300         15.5   77      1
## 221 33.5         4         85.0         70   1945         16.8   77      3
## 222 17.5         8        305.0        145   3880         12.5   77      1
## 223 17.0         8        260.0        110   4060         19.0   77      1
## 224 15.5         8        318.0        145   4140         13.7   77      1
## 225 15.0         8        302.0        130   4295         14.9   77      1
## 226 17.5         6        250.0        110   3520         16.4   77      1
## 227 20.5         6        231.0        105   3425         16.9   77      1
## 228 19.0         6        225.0        100   3630         17.7   77      1
## 229 18.5         6        250.0         98   3525         19.0   77      1
## 230 16.0         8        400.0        180   4220         11.1   77      1
## 231 15.5         8        350.0        170   4165         11.4   77      1
## 232 15.5         8        400.0        190   4325         12.2   77      1
## 233 16.0         8        351.0        149   4335         14.5   77      1
## 234 29.0         4         97.0         78   1940         14.5   77      2
## 235 24.5         4        151.0         88   2740         16.0   77      1
## 236 26.0         4         97.0         75   2265         18.2   77      3
## 237 25.5         4        140.0         89   2755         15.8   77      1
## 238 30.5         4         98.0         63   2051         17.0   77      1
## 239 33.5         4         98.0         83   2075         15.9   77      1
## 240 30.0         4         97.0         67   1985         16.4   77      3
## 241 30.5         4         97.0         78   2190         14.1   77      2
## 242 22.0         6        146.0         97   2815         14.5   77      3
## 243 21.5         4        121.0        110   2600         12.8   77      2
## 244 21.5         3         80.0        110   2720         13.5   77      3
## 245 43.1         4         90.0         48   1985         21.5   78      2
## 246 36.1         4         98.0         66   1800         14.4   78      1
## 247 32.8         4         78.0         52   1985         19.4   78      3
## 248 39.4         4         85.0         70   2070         18.6   78      3
## 249 36.1         4         91.0         60   1800         16.4   78      3
## 250 19.9         8        260.0        110   3365         15.5   78      1
## 251 19.4         8        318.0        140   3735         13.2   78      1
## 252 20.2         8        302.0        139   3570         12.8   78      1
## 253 19.2         6        231.0        105   3535         19.2   78      1
## 254 20.5         6        200.0         95   3155         18.2   78      1
## 255 20.2         6        200.0         85   2965         15.8   78      1
## 256 25.1         4        140.0         88   2720         15.4   78      1
## 257 20.5         6        225.0        100   3430         17.2   78      1
## 258 19.4         6        232.0         90   3210         17.2   78      1
## 259 20.6         6        231.0        105   3380         15.8   78      1
## 260 20.8         6        200.0         85   3070         16.7   78      1
## 261 18.6         6        225.0        110   3620         18.7   78      1
## 262 18.1         6        258.0        120   3410         15.1   78      1
## 263 19.2         8        305.0        145   3425         13.2   78      1
## 264 17.7         6        231.0        165   3445         13.4   78      1
## 265 18.1         8        302.0        139   3205         11.2   78      1
## 266 17.5         8        318.0        140   4080         13.7   78      1
## 267 30.0         4         98.0         68   2155         16.5   78      1
## 268 27.5         4        134.0         95   2560         14.2   78      3
## 269 27.2         4        119.0         97   2300         14.7   78      3
## 270 30.9         4        105.0         75   2230         14.5   78      1
## 271 21.1         4        134.0         95   2515         14.8   78      3
## 272 23.2         4        156.0        105   2745         16.7   78      1
## 273 23.8         4        151.0         85   2855         17.6   78      1
## 274 23.9         4        119.0         97   2405         14.9   78      3
## 275 20.3         5        131.0        103   2830         15.9   78      2
## 276 17.0         6        163.0        125   3140         13.6   78      2
## 277 21.6         4        121.0        115   2795         15.7   78      2
## 278 16.2         6        163.0        133   3410         15.8   78      2
## 279 31.5         4         89.0         71   1990         14.9   78      2
## 280 29.5         4         98.0         68   2135         16.6   78      3
## 281 21.5         6        231.0        115   3245         15.4   79      1
## 282 19.8         6        200.0         85   2990         18.2   79      1
## 283 22.3         4        140.0         88   2890         17.3   79      1
## 284 20.2         6        232.0         90   3265         18.2   79      1
## 285 20.6         6        225.0        110   3360         16.6   79      1
## 286 17.0         8        305.0        130   3840         15.4   79      1
## 287 17.6         8        302.0        129   3725         13.4   79      1
## 288 16.5         8        351.0        138   3955         13.2   79      1
## 289 18.2         8        318.0        135   3830         15.2   79      1
## 290 16.9         8        350.0        155   4360         14.9   79      1
## 291 15.5         8        351.0        142   4054         14.3   79      1
## 292 19.2         8        267.0        125   3605         15.0   79      1
## 293 18.5         8        360.0        150   3940         13.0   79      1
## 294 31.9         4         89.0         71   1925         14.0   79      2
## 295 34.1         4         86.0         65   1975         15.2   79      3
## 296 35.7         4         98.0         80   1915         14.4   79      1
## 297 27.4         4        121.0         80   2670         15.0   79      1
## 298 25.4         5        183.0         77   3530         20.1   79      2
## 299 23.0         8        350.0        125   3900         17.4   79      1
## 300 27.2         4        141.0         71   3190         24.8   79      2
## 301 23.9         8        260.0         90   3420         22.2   79      1
## 302 34.2         4        105.0         70   2200         13.2   79      1
## 303 34.5         4        105.0         70   2150         14.9   79      1
## 304 31.8         4         85.0         65   2020         19.2   79      3
## 305 37.3         4         91.0         69   2130         14.7   79      2
## 306 28.4         4        151.0         90   2670         16.0   79      1
## 307 28.8         6        173.0        115   2595         11.3   79      1
## 308 26.8         6        173.0        115   2700         12.9   79      1
## 309 33.5         4        151.0         90   2556         13.2   79      1
## 310 41.5         4         98.0         76   2144         14.7   80      2
## 311 38.1         4         89.0         60   1968         18.8   80      3
## 312 32.1         4         98.0         70   2120         15.5   80      1
## 313 37.2         4         86.0         65   2019         16.4   80      3
## 314 28.0         4        151.0         90   2678         16.5   80      1
## 315 26.4         4        140.0         88   2870         18.1   80      1
## 316 24.3         4        151.0         90   3003         20.1   80      1
## 317 19.1         6        225.0         90   3381         18.7   80      1
## 318 34.3         4         97.0         78   2188         15.8   80      2
## 319 29.8         4        134.0         90   2711         15.5   80      3
## 320 31.3         4        120.0         75   2542         17.5   80      3
## 321 37.0         4        119.0         92   2434         15.0   80      3
## 322 32.2         4        108.0         75   2265         15.2   80      3
## 323 46.6         4         86.0         65   2110         17.9   80      3
## 324 27.9         4        156.0        105   2800         14.4   80      1
## 325 40.8         4         85.0         65   2110         19.2   80      3
## 326 44.3         4         90.0         48   2085         21.7   80      2
## 327 43.4         4         90.0         48   2335         23.7   80      2
## 328 36.4         5        121.0         67   2950         19.9   80      2
## 329 30.0         4        146.0         67   3250         21.8   80      2
## 330 44.6         4         91.0         67   1850         13.8   80      3
## 332 33.8         4         97.0         67   2145         18.0   80      3
## 333 29.8         4         89.0         62   1845         15.3   80      2
## 334 32.7         6        168.0        132   2910         11.4   80      3
## 335 23.7         3         70.0        100   2420         12.5   80      3
## 336 35.0         4        122.0         88   2500         15.1   80      2
## 338 32.4         4        107.0         72   2290         17.0   80      3
## 339 27.2         4        135.0         84   2490         15.7   81      1
## 340 26.6         4        151.0         84   2635         16.4   81      1
## 341 25.8         4        156.0         92   2620         14.4   81      1
## 342 23.5         6        173.0        110   2725         12.6   81      1
## 343 30.0         4        135.0         84   2385         12.9   81      1
## 344 39.1         4         79.0         58   1755         16.9   81      3
## 345 39.0         4         86.0         64   1875         16.4   81      1
## 346 35.1         4         81.0         60   1760         16.1   81      3
## 347 32.3         4         97.0         67   2065         17.8   81      3
## 348 37.0         4         85.0         65   1975         19.4   81      3
## 349 37.7         4         89.0         62   2050         17.3   81      3
## 350 34.1         4         91.0         68   1985         16.0   81      3
## 351 34.7         4        105.0         63   2215         14.9   81      1
## 352 34.4         4         98.0         65   2045         16.2   81      1
## 353 29.9         4         98.0         65   2380         20.7   81      1
## 354 33.0         4        105.0         74   2190         14.2   81      2
## 356 33.7         4        107.0         75   2210         14.4   81      3
## 357 32.4         4        108.0         75   2350         16.8   81      3
## 358 32.9         4        119.0        100   2615         14.8   81      3
## 359 31.6         4        120.0         74   2635         18.3   81      3
## 360 28.1         4        141.0         80   3230         20.4   81      2
## 361 30.7         6        145.0         76   3160         19.6   81      2
## 362 25.4         6        168.0        116   2900         12.6   81      3
## 363 24.2         6        146.0        120   2930         13.8   81      3
## 364 22.4         6        231.0        110   3415         15.8   81      1
## 365 26.6         8        350.0        105   3725         19.0   81      1
## 366 20.2         6        200.0         88   3060         17.1   81      1
## 367 17.6         6        225.0         85   3465         16.6   81      1
## 368 28.0         4        112.0         88   2605         19.6   82      1
## 369 27.0         4        112.0         88   2640         18.6   82      1
## 370 34.0         4        112.0         88   2395         18.0   82      1
## 371 31.0         4        112.0         85   2575         16.2   82      1
## 372 29.0         4        135.0         84   2525         16.0   82      1
## 373 27.0         4        151.0         90   2735         18.0   82      1
## 374 24.0         4        140.0         92   2865         16.4   82      1
## 375 36.0         4        105.0         74   1980         15.3   82      2
## 376 37.0         4         91.0         68   2025         18.2   82      3
## 377 31.0         4         91.0         68   1970         17.6   82      3
## 378 38.0         4        105.0         63   2125         14.7   82      1
## 379 36.0         4         98.0         70   2125         17.3   82      1
## 380 36.0         4        120.0         88   2160         14.5   82      3
## 381 36.0         4        107.0         75   2205         14.5   82      3
## 382 34.0         4        108.0         70   2245         16.9   82      3
## 383 38.0         4         91.0         67   1965         15.0   82      3
## 384 32.0         4         91.0         67   1965         15.7   82      3
## 385 38.0         4         91.0         67   1995         16.2   82      3
## 386 25.0         6        181.0        110   2945         16.4   82      1
## 387 38.0         6        262.0         85   3015         17.0   82      1
## 388 26.0         4        156.0         92   2585         14.5   82      1
## 389 22.0         6        232.0        112   2835         14.7   82      1
## 390 32.0         4        144.0         96   2665         13.9   82      3
## 391 36.0         4        135.0         84   2370         13.0   82      1
## 392 27.0         4        151.0         90   2950         17.3   82      1
## 393 27.0         4        140.0         86   2790         15.6   82      1
## 394 44.0         4         97.0         52   2130         24.6   82      2
## 395 32.0         4        135.0         84   2295         11.6   82      1
## 396 28.0         4        120.0         79   2625         18.6   82      1
## 397 31.0         4        119.0         82   2720         19.4   82      1
##                                     name
## 1              chevrolet chevelle malibu
## 2                      buick skylark 320
## 3                     plymouth satellite
## 4                          amc rebel sst
## 5                            ford torino
## 6                       ford galaxie 500
## 7                       chevrolet impala
## 8                      plymouth fury iii
## 9                       pontiac catalina
## 10                    amc ambassador dpl
## 11                   dodge challenger se
## 12                    plymouth 'cuda 340
## 13                 chevrolet monte carlo
## 14               buick estate wagon (sw)
## 15                 toyota corona mark ii
## 16                       plymouth duster
## 17                            amc hornet
## 18                         ford maverick
## 19                          datsun pl510
## 20          volkswagen 1131 deluxe sedan
## 21                           peugeot 504
## 22                           audi 100 ls
## 23                              saab 99e
## 24                              bmw 2002
## 25                           amc gremlin
## 26                             ford f250
## 27                             chevy c20
## 28                            dodge d200
## 29                              hi 1200d
## 30                          datsun pl510
## 31                   chevrolet vega 2300
## 32                         toyota corona
## 34                           amc gremlin
## 35             plymouth satellite custom
## 36             chevrolet chevelle malibu
## 37                       ford torino 500
## 38                           amc matador
## 39                      chevrolet impala
## 40             pontiac catalina brougham
## 41                      ford galaxie 500
## 42                     plymouth fury iii
## 43                     dodge monaco (sw)
## 44              ford country squire (sw)
## 45                   pontiac safari (sw)
## 46            amc hornet sportabout (sw)
## 47                   chevrolet vega (sw)
## 48                      pontiac firebird
## 49                          ford mustang
## 50                    mercury capri 2000
## 51                             opel 1900
## 52                           peugeot 304
## 53                             fiat 124b
## 54                   toyota corolla 1200
## 55                           datsun 1200
## 56                  volkswagen model 111
## 57                      plymouth cricket
## 58                 toyota corona hardtop
## 59                    dodge colt hardtop
## 60                     volkswagen type 3
## 61                        chevrolet vega
## 62                   ford pinto runabout
## 63                      chevrolet impala
## 64                      pontiac catalina
## 65                     plymouth fury iii
## 66                      ford galaxie 500
## 67                    amc ambassador sst
## 68                       mercury marquis
## 69                  buick lesabre custom
## 70            oldsmobile delta 88 royale
## 71                chrysler newport royal
## 72                       mazda rx2 coupe
## 73                      amc matador (sw)
## 74      chevrolet chevelle concours (sw)
## 75                 ford gran torino (sw)
## 76        plymouth satellite custom (sw)
## 77                       volvo 145e (sw)
## 78                   volkswagen 411 (sw)
## 79                      peugeot 504 (sw)
## 80                       renault 12 (sw)
## 81                       ford pinto (sw)
## 82                       datsun 510 (sw)
## 83           toyouta corona mark ii (sw)
## 84                       dodge colt (sw)
## 85              toyota corolla 1600 (sw)
## 86                     buick century 350
## 87                           amc matador
## 88                      chevrolet malibu
## 89                      ford gran torino
## 90                  dodge coronet custom
## 91              mercury marquis brougham
## 92             chevrolet caprice classic
## 93                              ford ltd
## 94              plymouth fury gran sedan
## 95          chrysler new yorker brougham
## 96              buick electra 225 custom
## 97               amc ambassador brougham
## 98                      plymouth valiant
## 99                 chevrolet nova custom
## 100                           amc hornet
## 101                        ford maverick
## 102                      plymouth duster
## 103              volkswagen super beetle
## 104                     chevrolet impala
## 105                         ford country
## 106               plymouth custom suburb
## 107             oldsmobile vista cruiser
## 108                          amc gremlin
## 109                        toyota carina
## 110                       chevrolet vega
## 111                           datsun 610
## 112                            maxda rx3
## 113                           ford pinto
## 114                     mercury capri v6
## 115                 fiat 124 sport coupe
## 116              chevrolet monte carlo s
## 117                   pontiac grand prix
## 118                             fiat 128
## 119                           opel manta
## 120                           audi 100ls
## 121                          volvo 144ea
## 122                    dodge dart custom
## 123                            saab 99le
## 124                       toyota mark ii
## 125                     oldsmobile omega
## 126                      plymouth duster
## 128                           amc hornet
## 129                       chevrolet nova
## 130                          datsun b210
## 131                           ford pinto
## 132                  toyota corolla 1200
## 133                       chevrolet vega
## 134    chevrolet chevelle malibu classic
## 135                          amc matador
## 136           plymouth satellite sebring
## 137                     ford gran torino
## 138             buick century luxus (sw)
## 139            dodge coronet custom (sw)
## 140                ford gran torino (sw)
## 141                     amc matador (sw)
## 142                             audi fox
## 143                    volkswagen dasher
## 144                           opel manta
## 145                        toyota corona
## 146                           datsun 710
## 147                           dodge colt
## 148                             fiat 128
## 149                          fiat 124 tc
## 150                          honda civic
## 151                               subaru
## 152                            fiat x1.9
## 153              plymouth valiant custom
## 154                       chevrolet nova
## 155                      mercury monarch
## 156                        ford maverick
## 157                     pontiac catalina
## 158                    chevrolet bel air
## 159                  plymouth grand fury
## 160                             ford ltd
## 161                        buick century
## 162            chevroelt chevelle malibu
## 163                          amc matador
## 164                        plymouth fury
## 165                        buick skyhawk
## 166                  chevrolet monza 2+2
## 167                      ford mustang ii
## 168                       toyota corolla
## 169                           ford pinto
## 170                          amc gremlin
## 171                        pontiac astro
## 172                        toyota corona
## 173                    volkswagen dasher
## 174                           datsun 710
## 175                           ford pinto
## 176                    volkswagen rabbit
## 177                            amc pacer
## 178                           audi 100ls
## 179                          peugeot 504
## 180                          volvo 244dl
## 181                            saab 99le
## 182                     honda civic cvcc
## 183                             fiat 131
## 184                            opel 1900
## 185                             capri ii
## 186                           dodge colt
## 187                         renault 12tl
## 188    chevrolet chevelle malibu classic
## 189               dodge coronet brougham
## 190                          amc matador
## 191                     ford gran torino
## 192                     plymouth valiant
## 193                       chevrolet nova
## 194                        ford maverick
## 195                           amc hornet
## 196                   chevrolet chevette
## 197                      chevrolet woody
## 198                            vw rabbit
## 199                          honda civic
## 200                       dodge aspen se
## 201                    ford granada ghia
## 202                   pontiac ventura sj
## 203                        amc pacer d/l
## 204                    volkswagen rabbit
## 205                         datsun b-210
## 206                       toyota corolla
## 207                           ford pinto
## 208                            volvo 245
## 209           plymouth volare premier v8
## 210                          peugeot 504
## 211                       toyota mark ii
## 212                   mercedes-benz 280s
## 213                     cadillac seville
## 214                            chevy c10
## 215                            ford f108
## 216                           dodge d100
## 217                    honda accord cvcc
## 218              buick opel isuzu deluxe
## 219                        renault 5 gtl
## 220                    plymouth arrow gs
## 221                datsun f-10 hatchback
## 222            chevrolet caprice classic
## 223           oldsmobile cutlass supreme
## 224                dodge monaco brougham
## 225              mercury cougar brougham
## 226                   chevrolet concours
## 227                        buick skylark
## 228               plymouth volare custom
## 229                         ford granada
## 230                pontiac grand prix lj
## 231         chevrolet monte carlo landau
## 232                     chrysler cordoba
## 233                     ford thunderbird
## 234             volkswagen rabbit custom
## 235                pontiac sunbird coupe
## 236              toyota corolla liftback
## 237                  ford mustang ii 2+2
## 238                   chevrolet chevette
## 239                       dodge colt m/m
## 240                            subaru dl
## 241                    volkswagen dasher
## 242                           datsun 810
## 243                             bmw 320i
## 244                           mazda rx-4
## 245      volkswagen rabbit custom diesel
## 246                          ford fiesta
## 247                     mazda glc deluxe
## 248                       datsun b210 gx
## 249                     honda civic cvcc
## 250    oldsmobile cutlass salon brougham
## 251                       dodge diplomat
## 252                 mercury monarch ghia
## 253                   pontiac phoenix lj
## 254                     chevrolet malibu
## 255                 ford fairmont (auto)
## 256                  ford fairmont (man)
## 257                      plymouth volare
## 258                          amc concord
## 259                buick century special
## 260                       mercury zephyr
## 261                          dodge aspen
## 262                      amc concord d/l
## 263         chevrolet monte carlo landau
## 264      buick regal sport coupe (turbo)
## 265                          ford futura
## 266                      dodge magnum xe
## 267                   chevrolet chevette
## 268                        toyota corona
## 269                           datsun 510
## 270                           dodge omni
## 271            toyota celica gt liftback
## 272                     plymouth sapporo
## 273               oldsmobile starfire sx
## 274                        datsun 200-sx
## 275                            audi 5000
## 276                          volvo 264gl
## 277                           saab 99gle
## 278                        peugeot 604sl
## 279                  volkswagen scirocco
## 280                      honda accord lx
## 281                    pontiac lemans v6
## 282                     mercury zephyr 6
## 283                      ford fairmont 4
## 284                     amc concord dl 6
## 285                        dodge aspen 6
## 286            chevrolet caprice classic
## 287                      ford ltd landau
## 288                mercury grand marquis
## 289                      dodge st. regis
## 290              buick estate wagon (sw)
## 291             ford country squire (sw)
## 292        chevrolet malibu classic (sw)
## 293 chrysler lebaron town @ country (sw)
## 294                     vw rabbit custom
## 295                     maxda glc deluxe
## 296          dodge colt hatchback custom
## 297                        amc spirit dl
## 298                   mercedes benz 300d
## 299                    cadillac eldorado
## 300                          peugeot 504
## 301    oldsmobile cutlass salon brougham
## 302                     plymouth horizon
## 303                 plymouth horizon tc3
## 304                           datsun 210
## 305                   fiat strada custom
## 306                buick skylark limited
## 307                   chevrolet citation
## 308            oldsmobile omega brougham
## 309                      pontiac phoenix
## 310                            vw rabbit
## 311                toyota corolla tercel
## 312                   chevrolet chevette
## 313                           datsun 310
## 314                   chevrolet citation
## 315                        ford fairmont
## 316                          amc concord
## 317                          dodge aspen
## 318                            audi 4000
## 319               toyota corona liftback
## 320                            mazda 626
## 321                 datsun 510 hatchback
## 322                       toyota corolla
## 323                            mazda glc
## 324                           dodge colt
## 325                           datsun 210
## 326                 vw rabbit c (diesel)
## 327                   vw dasher (diesel)
## 328                  audi 5000s (diesel)
## 329                   mercedes-benz 240d
## 330                  honda civic 1500 gl
## 332                            subaru dl
## 333                     vokswagen rabbit
## 334                        datsun 280-zx
## 335                        mazda rx-7 gs
## 336                    triumph tr7 coupe
## 338                         honda accord
## 339                     plymouth reliant
## 340                        buick skylark
## 341               dodge aries wagon (sw)
## 342                   chevrolet citation
## 343                     plymouth reliant
## 344                       toyota starlet
## 345                       plymouth champ
## 346                     honda civic 1300
## 347                               subaru
## 348                       datsun 210 mpg
## 349                        toyota tercel
## 350                          mazda glc 4
## 351                   plymouth horizon 4
## 352                       ford escort 4w
## 353                       ford escort 2h
## 354                     volkswagen jetta
## 356                        honda prelude
## 357                       toyota corolla
## 358                         datsun 200sx
## 359                            mazda 626
## 360            peugeot 505s turbo diesel
## 361                         volvo diesel
## 362                      toyota cressida
## 363                    datsun 810 maxima
## 364                        buick century
## 365                oldsmobile cutlass ls
## 366                      ford granada gl
## 367               chrysler lebaron salon
## 368                   chevrolet cavalier
## 369             chevrolet cavalier wagon
## 370            chevrolet cavalier 2-door
## 371           pontiac j2000 se hatchback
## 372                       dodge aries se
## 373                      pontiac phoenix
## 374                 ford fairmont futura
## 375                  volkswagen rabbit l
## 376                   mazda glc custom l
## 377                     mazda glc custom
## 378               plymouth horizon miser
## 379                       mercury lynx l
## 380                     nissan stanza xe
## 381                         honda accord
## 382                       toyota corolla
## 383                          honda civic
## 384                   honda civic (auto)
## 385                        datsun 310 gx
## 386                buick century limited
## 387    oldsmobile cutlass ciera (diesel)
## 388           chrysler lebaron medallion
## 389                       ford granada l
## 390                     toyota celica gt
## 391                    dodge charger 2.2
## 392                     chevrolet camaro
## 393                      ford mustang gl
## 394                            vw pickup
## 395                        dodge rampage
## 396                          ford ranger
## 397                           chevy s-10
  1. Explore the data, list the variables with clear definitions. Set each variable with its appropriate class. For example origin should be set as a factor.

Explaining the Variables:

mpg: Miles per gallon, a measurement of fuel economy, outlining how many miles the car is able to travel on a gallon of gas. cylinders: The number of cylinders the car’s engine has, ranging from 4-8 cylinders. displacement: The engine displacement in cubic inches. horsepower: The power of the engine in horsepower. weight: The weight of the vehicle in lbs. acceleration: Time to accelerate from 0 to 60 mph in seconds. year: The model year of the vehicle. origin: The car’s country of origin. name: Brand and model name of the vehicle.

All variables except origin and name are numeric. origin is a categorical factor that represents the country of manufacture.

  1. How many cars are included in this data set?
## [1] 392

392 cars are included in this data set.

  1. EDA, focus on pairwise plots and summary statistics. Briefly summarize your findings and any peculiarities in the data.
##   mean_mpg sd_mpg min_mpg max_mpg
## 1     23.4   7.81       9    46.6

The average mpg was 23.4, with the least fuel economic car achieving only 9 mpg, while the more fuel economic car achieved 46.6 mpg. There is moderate variation in the mpg, with a standard deviation of 7.81 mpg.

## # A tibble: 3 × 3
##   origin mean_mpg sd_mpg
##   <fct>     <dbl>  <dbl>
## 1 USA        20.0   6.44
## 2 Europe     27.6   6.58
## 3 Japan      30.5   6.09

Grouping the vehicles by country, the mean and standard deviation of fuel economy reveals that Japanese cars on average have the greatest mpg at 30.5, followed by European then American vehicles.

The fuel economy in Japanese cars was also the most consistent, with a standard deviation of 6.09 mpg, compared to 6.58 and 6.44 mpg for European and American cars respectively.

Relationship between Displacement and MPG

The scatterplot above shows that as the engine’s displacement increases, the MPG decreases.

Relationship between Weight and MPG

Relationship between Acceleration and MPG

3.2 What effect does time have on MPG?

  1. Start with a simple regression of mpg vs. year and report R’s summary output. Is year a significant variable at the .05 level? State what effect year has on mpg, if any, according to this model.

##                2.5 %  97.5 %
## (Intercept) -2746.46 -2067.7
## year            1.06     1.4
## 
## Call:
## lm(formula = mpg ~ year, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.021  -5.441  -0.441   4.974  18.209 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.41e+03   1.73e+02   -13.9   <2e-16 ***
## year         1.23e+00   8.74e-02    14.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.36 on 390 degrees of freedom
## Multiple R-squared:  0.337,  Adjusted R-squared:  0.335 
## F-statistic:  198 on 1 and 390 DF,  p-value: <2e-16

The p-value for this linear model is <2e-16. Hence, we can conclude that year is a significant variable at the .05 level. The regression line suggests that on average, as the year increases, the mpg increases by 1.23.

  1. Add horsepower on top of the variable year to your linear model. Is year still a significant variable at the .05 level? Give a precise interpretation of the year’s effect found here.

As the horsepower output of the vehicle increases, the mpg decreases.

##                 2.5 %    97.5 %
## (Intercept) -1519.516 -1003.580
## year            0.527     0.788
## horsepower     -0.144    -0.119
## 
## Call:
## lm(formula = mpg ~ year + horsepower, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.077  -3.078  -0.431   2.588  15.315 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.26e+03   1.31e+02   -9.61   <2e-16 ***
## year         6.57e-01   6.63e-02    9.92   <2e-16 ***
## horsepower  -1.32e-01   6.34e-03  -20.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.39 on 389 degrees of freedom
## Multiple R-squared:  0.685,  Adjusted R-squared:  0.684 
## F-statistic:  424 on 2 and 389 DF,  p-value: <2e-16

After fitting the multiple linear regression model, while holding the horsepower constant, a one-year increase in the model year is associated with an average increase of approximately 0.657 mpg. Comparing to simple regression, where a one-year increase in model year lead to an increase of 1.23 mpg on average, after controlling for horsepower, it almost halved.

Now holding year constant, a one-unit increase in horsepower leads to a decrease of about 0.132 mpg on average. The p-values for both year and horsepower are < 2e-16, meaning they are both statistically significant at the 0.05 level.

The residual standard error suggests that the typical deviation of observed mpg values from the fitted regression surface is about 4.39 mpg.

  1. The two 95% CI’s for the coefficient of year differ among (a) and (b). How would you explain the difference to a non-statistician?

For the simple regression, the 95% CI was \[(1.06, 1.40)\]. This means that each additional year is associated with an increase of 1.06-1.4 mpg, when year is the only predictor.

For the multiple linear regression, the 95% CI was \[(0.527, 0.788)\], which means that the an increase in year leads to only abour 0.66 mpg, after adjusting for horsepower.

## [1] -0.416

Obtaining the correlation between year and horsepower, we see there is a negative correlation. This means that as year increases, horsepower tends to decrease on average. Historically, this is true, as vehicles from the 1970s had bigger engines with higher horsepower. However, after the energy crisis in the 1970s and 1980s, the sizes of engines got much smaller, which in turn decreased the horsepower output of these vehicles.

This explains why when we keep horsepower constant as year increases, the fuel economy doesn’t improve as much, giving a lower confidence interval.

  1. Create a model with interaction by fitting lm(mpg ~ year * horsepower). Is the interaction effect significant at .05 level? Explain the year effect (if any).
## 
## Call:
## lm(formula = mpg ~ year * horsepower, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.349  -2.451  -0.456   2.406  14.444 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -4.29e+03   3.19e+02   -13.5   <2e-16 ***
## year             2.19e+00   1.61e-01    13.6   <2e-16 ***
## horsepower       3.14e+01   3.08e+00    10.2   <2e-16 ***
## year:horsepower -1.60e-02   1.56e-03   -10.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.9 on 388 degrees of freedom
## Multiple R-squared:  0.752,  Adjusted R-squared:  0.75 
## F-statistic:  393 on 3 and 388 DF,  p-value: <2e-16

The interaction between year and horsepower is statistically significant at the 0.05 level as the p-values are <2e-16. The fitted model is effectively \[ mpg = \beta_0+\beta_1\times year+\beta_2\times horsepower+\beta_3(year\times horsepower)\] From this we can conclude that the effect of year now also depends on horsepower.

  • For low horsepower cars → year has a larger positive effect.
  • For high horsepower cars → year has a smaller positive effect.

The negative interaction coefficient of -0.016 for year:horsepower means the year improvement in mpg shrinks as horsepower increases.

3.3 Categorical predictors

Remember that the same variable can play different roles! Take a quick look at the variable cylinders, and try to use this variable in the following analyses wisely. We all agree that a larger number of cylinders will lower mpg. However, we can interpret cylinders as either a continuous (numeric) variable or a categorical variable.

  1. Fit a model that treats cylinders as a continuous/numeric variable. Is cylinders significant at the 0.01 level? What effect does cylinders play in this model?
## 
## Call:
## lm(formula = mpg ~ cylinders, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.241  -3.183  -0.633   2.549  17.917 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   42.916      0.835    51.4   <2e-16 ***
## cylinders     -3.558      0.146   -24.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.91 on 390 degrees of freedom
## Multiple R-squared:  0.605,  Adjusted R-squared:  0.604 
## F-statistic:  597 on 1 and 390 DF,  p-value: <2e-16
##             2.5 % 97.5 %
## (Intercept) 41.27  44.56
## cylinders   -3.84  -3.27
## [1] TRUE
## cylinders 
##     -3.56

The p-value for cylinders is <2e-16, hence cylinders is statistically significant at the 0.01 level.

The slope estimate is -3.558, indicating that for each additional cylinder, the expected mpg decreases by approximately 3.558 on average.

The 95% confidence interval for the slop is \[(-3.84, -3.27)\].

The entire interval is negative, confirming a strong negative relationship betwene cylinders and mpg, further highlighted in the plot below.

## `geom_smooth()` using formula = 'y ~ x'

  1. Fit a model that treats cylinders as a categorical/factor. Is cylinders significant at the .01 level? What is the effect of cylinders in this model? Describe the cylinders effect over mpg.
## 
## Call:
## lm(formula = mpg ~ cylinders_f, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.284  -2.904  -0.963   2.344  18.027 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    20.550      2.349    8.75  < 2e-16 ***
## cylinders_f4    8.734      2.373    3.68  0.00027 ***
## cylinders_f5    6.817      3.589    1.90  0.05825 .  
## cylinders_f6   -0.577      2.405   -0.24  0.81071    
## cylinders_f8   -5.587      2.395   -2.33  0.02015 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.7 on 387 degrees of freedom
## Multiple R-squared:  0.641,  Adjusted R-squared:  0.638 
## F-statistic:  173 on 4 and 387 DF,  p-value: <2e-16
##                2.5 % 97.5 %
## (Intercept)   15.931 25.169
## cylinders_f4   4.069 13.399
## cylinders_f5  -0.239 13.873
## cylinders_f6  -5.306  4.153
## cylinders_f8 -10.295 -0.879
## Anova Table (Type II tests)
## 
## Response: mpg
##             Sum Sq  Df F value Pr(>F)    
## cylinders_f  15275   4     173 <2e-16 ***
## Residuals     8544 387                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## # A tibble: 5 × 4
##   cylinders_f     n mean_mpg sd_mpg
##   <fct>       <int>    <dbl>  <dbl>
## 1 3               4     20.6   2.56
## 2 4             199     29.3   5.67
## 3 5               3     27.4   8.23
## 4 6              83     20.0   3.83
## 5 8             103     15.0   2.84

The p-value for cylinders is still <2e-16 even as a categorical variable, significant at the 0.01 level.

Looking at the table of mean and standard deviation for the different cylinder counts, we can see that 4 cylinder vehicles have the highest mean mpg and the fuel economy decreases as the numbe rof cylinders increases. We can see that the mean mpg for 3 cylinder vehicles is fairly low, which goes against the trend identified in part a). However, we see that there are only 4 observations with 3 cylinders, so this estimate is unstable and not representative of the true fuel economy of these vehicles. This is also true for 5 cylinder vehicles, for which there were only 3 observations.

Observing the box plot we can see that there are a lot of outliers for 6 cylinder vehicles, with one vehicle achieving greater mpg than the most fuel efficient 5 cylinder vehicle. 4 cylinder vehicles have the greatest range in mpg values, but also has the most number of observations.

  1. What are the fundamental differences between treating cylinders as a continuous and categorical variable in your models?

The numerical model assumes that \[ E[mpg|cylinders]=\beta_0+\beta_1\times cylinders \]

This forces a linear trend and imposes a constant change in mpg per each additional cylinder added.

Meanwhile, the factor model assumes that \[E[mpg|cylinders=k]=\mu_k\]

This does not assume linearity, allowing each cylinder category to have its own mean.

  1. Can you test the null hypothesis: fit0: mpg is linear in cylinders vs. fit1: mpg relates to cylinders as a categorical variable at .01 level?
## Analysis of Variance Table
## 
## Model 1: mpg ~ cylinders
## Model 2: mpg ~ cylinders_f
##   Res.Df  RSS Df Sum of Sq    F  Pr(>F)    
## 1    390 9416                              
## 2    387 8544  3       871 13.2 3.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova was used to compare the linear and factor models. From the ANOVA table, we see that \[F=13.2,\qquad p=3.4\times 10^{-8}\]

Since \[3.4\times 10^{-8}<0.01\], we reject the null hypothesis at the 0.01 level, which states that the true relationship between mpg and cylinders was linear.

Hence, there is strong evidence that the relationship between mpg and cylinder is not purely linear, proving that the categorical model provides a much better fit than the linear model. Treating cylinders as a factor variable (which it is), is more appropriate.

3.4 Results

Final modeling question: we want to explore the effects of each feature as best as possible. You may explore interactions, feature transformations, higher order terms, or other strategies within reason. The model(s) should be as parsimonious (simple) as possible unless the gain in accuracy is significant from your point of view.

  1. Describe the final model. Include diagnostic plots with particular focus on the model residuals and diagnoses.
##              displacement weight horsepower
## displacement        1.000  0.933      0.897
## weight              0.933  1.000      0.865
## horsepower          0.897  0.865      1.000

We can see displacement is highly correlated with both weight and horsepower, meaning keeping all three variables would not be necessary and would increase standard errors and reduce interpretability. Hence, displacement will be excluded from the final model.

The final model is \[mpg \sim year+weight+horsepower+cylinders+origin\].

The line shown in the residuals vs fitted plot shows a slight curvature suggesting minor non-linearity. From the Q-Q plot we can see that there is a small increase in variance at higher fitted values, indicating slight heteroskedasticity. However, the vast majority of the points lie along the line suggesting that the model provides a strong and appropriate fit to the data.

  1. Summarize the effects found.
## 
## Call:
## lm(formula = mpg ~ year + weight + horsepower + cylinders_f + 
##     origin, data = auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.442 -1.953 -0.063  1.563 12.786 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.39e+03   9.64e+01  -14.46  < 2e-16 ***
## year          7.22e-01   4.88e-02   14.81  < 2e-16 ***
## weight       -5.10e-03   5.06e-04  -10.07  < 2e-16 ***
## horsepower   -2.54e-02   9.79e-03   -2.60  0.00977 ** 
## cylinders_f4  7.67e+00   1.61e+00    4.76  2.7e-06 ***
## cylinders_f5  8.39e+00   2.47e+00    3.39  0.00077 ***
## cylinders_f6  5.24e+00   1.68e+00    3.12  0.00192 ** 
## cylinders_f8  8.03e+00   1.79e+00    4.49  9.6e-06 ***
## originEurope  1.28e+00   5.22e-01    2.45  0.01455 *  
## originJapan   2.21e+00   5.06e-01    4.38  1.6e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.12 on 382 degrees of freedom
## Multiple R-squared:  0.844,  Adjusted R-squared:  0.841 
## F-statistic:  230 on 9 and 382 DF,  p-value: <2e-16

Year and weight have p-values <2e-16. A unit increase in year leads to a 0.722 increase in mpg on average, while each pound added to a vehicle’s weight decreases its mpg by -0.0051 on average. Each additional horsepower reduces mpg by about 0.0254, but the p-value is 0.00977, which is close to the 0.01 threshold, indicating the horsepower is less statistically significant that year and weight.

Treating cylinders as a categorical variable confirms that engine configuration affects MPG in a non-linear manner. Additionally, vehicles from Japan and Europe exhibit higher MPG (1.28 and 2.21 mpg respectively) relative to U.S. vehicles after controlling for mechanical characteristics.

Overall, fuel efficiency is strongly influenced by vehicle size, engine output, year, and country of origin.

  1. Predict the mpg of the following car: A red car built in the US in 1983 that is 180 inches long, has eight cylinders, displaces 350 cu. inches, weighs 4000 pounds, and has a horsepower of 260. Also give a 95% CI for your prediction.

Since colour, length and displacement are not included in the model, they are not used in prediction.

##    fit  lwr  upr
## 1 19.5 12.9 26.1

Using the model described above, the predicted mpg of the vehicle is 19.5 mpg. The 95% confidence interval is \[(12.9, 26.1)\].

4 Simple Regression through simulations

4.1 Linear model through simulations

This exercise is designed to help you understand the linear model using simulations. In this exercise, we will generate \((x_i, y_i)\) pairs so that all linear model assumptions are met.

Presume that \(\mathbf{x}\) and \(\mathbf{y}\) are linearly related with a normal error \(\boldsymbol{\varepsilon}\) , such that \(\mathbf{y} = 1 + 1.2\mathbf{x} + \boldsymbol{\varepsilon}\). The standard deviation of the error \(\varepsilon_i\) is \(\sigma = 2\).

4.1.1 Generate data

Create a corresponding output vector for \(\mathbf{y}\) according to the equation given above. Use set.seed(1). Then, create a scatterplot with \((x_i, y_i)\) pairs. Base R plotting is acceptable, but if you can, please attempt to use ggplot2 to create the plot. Make sure to have clear labels and sensible titles on your plots.

4.1.2 Understand the model

  1. Find the LS estimates of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\), using the lm() function. What are the true values of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\)? Do the estimates look to be good?
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.662 -0.880  0.014  1.247  2.882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    1.331      0.557    2.39    0.022 *
## x              0.906      0.959    0.95    0.350  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.79 on 38 degrees of freedom
## Multiple R-squared:  0.023,  Adjusted R-squared:  -0.00272 
## F-statistic: 0.894 on 1 and 38 DF,  p-value: 0.35
## (Intercept) 
##        1.33
##     x 
## 0.906

The true values for \(\beta_0 and \beta_1\) are 1 and 1.2 respectively.

The estimated values are: \(\hat\beta_0=0.906, \hat\beta_1=1.33\). The estimates are fairly close to the true value.

  1. What is your RSE for this linear model fit? Is it close to \(\sigma = 2\)?
## [1] 1.79

The residual standard error of 1.79 is fairly close to the true value \(\sigma = 2\).

  1. What is the 95% confidence interval for \(\boldsymbol{\beta}_1\)? Does this confidence interval capture the true \(\boldsymbol{\beta}_1\)?
##  2.5 % 97.5 % 
##  -1.03   2.85

We have a 95% confidence interval of \((-1.03, 2.85)\), which includes the true value \(\boldsymbol{\beta}_1=1.2\).

  1. Overlay the LS estimates and the true lines of the mean function onto a copy of the scatterplot you made above.
## `geom_smooth()` using formula = 'y ~ x'

The black line is the true mean function. The blue line is the least squares fitted line. The two lines are fairly close, with minor deviations at lower x values.

4.1.3 diagnoses

  1. Provide residual plot where fitted \(\mathbf{y}\)-values are on the x-axis and residuals are on the y-axis.

  1. Provide a normal QQ plot of the residuals.

  1. Comment on how well the model assumptions are met for the sample you used.

The residuals vs fitted plot shows no systematic pattern or curvature. The residuals are randomly scattered around zero, with approximately constant spread, suggesting that the linearity and homoscedasticity assumptions are satisfied.

From the Q-Q plot we can see that several points fall below the reference line in the lower tail. However, this is expected for a small sample size of 40. The remaining points lie closely to the reference line.

Linear model assumptions are well satisfied in this sample.

4.2 Understand sampling distribution and confidence intervals

This part aims to help you understand the notion of sampling statistics and confidence intervals. Let’s concentrate on estimating the slope only.

Generate 100 samples of size \(n = 40\), and estimate the slope coefficient from each sample. Also construct 95% confidence intervals for the slope.

  1. Summarize the LS estimates of the slope. Does the sampling distribution agree with theory? (First specify the theoretical sampling distribution of the LS estimate.)
## [1] 1.04
## [1] 1.1
## [1] 1.07

Under a simple linear model \[y_i=\beta_0+\beta_1x_i+\epsilon_i,\qquad \epsilon_i\sim N(0,\sigma^2)\] with fixed \(x_1,...,x_n\), the least squares slope satisfies \[\hat\beta_1\sim N\bigg(\beta_1,\frac{\sigma^2}{S_{xx}}\bigg),\quad where \ \ S_{xx}=\sum_{i=1}^n(x_i-\bar x)^2.\] In this simulation, we have \(\beta_1=1.2 \ and \ \sigma=2\), so

\[\hat{\beta}_1 \sim N\!\left(1.2,\; \frac{4}{S_{xx}}\right), \quad \text{and} \quad \mathrm{SD}(\hat{\beta}_1) = \frac{2}{\sqrt{S_{xx}}}.\] The simulated standard deviation of 1.1 is close to the theoretical value of 1.07, so the variability matches the theory well. However, the simulated mean for the slope of 1.04 is slightly below the true slope of 1.2. Theoretically we would have \(E[\hat\beta_1]=\beta_1\). This discrepancy can be improved by conducting more simulations, and the simulated mean will converge to the true value.

  1. How many of your 95% confidence intervals capture the true \(\boldsymbol{\beta}_1\)? Display your confidence intervals graphically.

## [1] 0.96

Out of 100 simulated samples, 96 of the 95% confidence intervals contained the true slope \(\beta_1=1.2\). This empirical coverage rate of 96% is very close to the theoretical coverage of 95%.